[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition stage retry
[ https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-47520: --- Summary: Precision issues with sum of floats/doubles leads to incorrect data after repartition stage retry (was: Precision issues with sum of floats/doubles leads to incorrect data after repartition) > Precision issues with sum of floats/doubles leads to incorrect data after > repartition stage retry > - > > Key: SPARK-47520 > URL: https://issues.apache.org/jira/browse/SPARK-47520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.3.2, 3.5.0 >Reporter: William Montaz >Priority: Major > Labels: correctness > > We discovered an important correctness issue directly linked to SPARK-47024 > Even if SPARK-47024 has been considered 'Not a Problem' since it is linked > directly to floats and double precision, it can still have drastic impacts > combined to spark.sql.execution.sortBeforeRepartition set to true (the > default) > We consistently reproduced the issue doing a GROUP BY with a SUM of float or > double aggregation, followed by a repartition (common case to produce bigger > files as output, either triggered by SQL hints or extensions like kyuubi). > If the repartition stage fails with Fetch Failed Exception for only few > tasks, spark decides to recompute the partitions from the previous stage for > which output could not be fetched and will retry only the failed partitions > downstream. > Because block fetch order is indeterministic, the new upstream partition > computation can provide a slightly different value for a float/double sum > aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in > all of our attempts. The sort performed before repartition uses > UnsafeRow.hashcode for the row prefix which will be completely different even > with such 1 bit difference, leading to the sort being completely different in > the new upstream partition and thus target downstream partition for the > shuffled rows completely different as well. > Because sort becomes undeterministic and since only the failed dowstream > tasks are retried the resulting repartition will lead to duplicate rows as > well as missing rows. The solution brought by SPARK-23207 is broken. > So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to > make the entire job fail instead of producing incorrect data. The default for > spark currently leads to silent correctness issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition
[ https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-47520: --- Description: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double precision, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggregation, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new upstream partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new upstream partition and thus target downstream partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail instead of producing incorrect data. The default for spark currently leads to silent correctness issue. was: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double precision, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggregation, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new upstream partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new upstream partition and thus target downstream partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. > Precision issues with sum of floats/doubles leads to incorrect data after > repartition > - > > Key: SPARK-47520 > URL: https://issues.apache.org/jira/browse/SPARK-47520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.3.2, 3.5.0 >Reporter: William Montaz >Priority: Major > Labels: correctness > > We discovered an important correctness issue directly linked to SPARK-47024 > Even if SPARK-47024 has been considered 'Not a Problem' since it is linked > directly to floats and double precision, it can still have drastic impacts > combined to spark.sql.execution.sortBeforeRepartition set to true (the > default) > We consistently reproduced the issue doing a GROUP BY with a SUM of float or > double aggregation, followed by a repartition (common case to produce bigger > files as output, either triggered by SQL hints or extensions like kyuubi). > If the repartition stage fails with Fetch Failed Exception for only few > tasks, spark decides to recompute the partitions from the previous stage for > which output could not be fetched and will retry only the failed partitions > d
[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition
[ https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-47520: --- Description: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double precision, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggregation, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new upstream partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new upstream partition and thus target downstream partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. was: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double rounding, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggregation, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new upstream partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new upstream partition and thus target downstream partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. > Precision issues with sum of floats/doubles leads to incorrect data after > repartition > - > > Key: SPARK-47520 > URL: https://issues.apache.org/jira/browse/SPARK-47520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.3.2, 3.5.0 >Reporter: William Montaz >Priority: Major > Labels: correctness > > We discovered an important correctness issue directly linked to SPARK-47024 > Even if SPARK-47024 has been considered 'Not a Problem' since it is linked > directly to floats and double precision, it can still have drastic impacts > combined to spark.sql.execution.sortBeforeRepartition set to true (the > default) > We consistently reproduced the issue doing a GROUP BY with a SUM of float or > double aggregation, followed by a repartition (common case to produce bigger > files as output, either triggered by SQL hints or extensions like kyuubi). > If the repartition stage fails with Fetch Failed Exception for only few > tasks, spark decides to recompute the partitions from the previous stage for > which output could not be fetched and will retry only the failed partitions > downstream. > Because block fetch order is indeterministic, the new upstream partition > computation can
[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition
[ https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-47520: --- Summary: Precision issues with sum of floats/doubles leads to incorrect data after repartition (was: Rounding issues with sum of floats/doubles leads to incorrect data after repartition) > Precision issues with sum of floats/doubles leads to incorrect data after > repartition > - > > Key: SPARK-47520 > URL: https://issues.apache.org/jira/browse/SPARK-47520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.3.2, 3.5.0 >Reporter: William Montaz >Priority: Major > Labels: correctness > > We discovered an important correctness issue directly linked to SPARK-47024 > Even if SPARK-47024 has been considered 'Not a Problem' since it is linked > directly to floats and double rounding, it can still have drastic impacts > combined to spark.sql.execution.sortBeforeRepartition set to true (the > default) > We consistently reproduced the issue doing a GROUP BY with a SUM of float or > double aggregation, followed by a repartition (common case to produce bigger > files as output, either triggered by SQL hints or extensions like kyuubi). > If the repartition stage fails with Fetch Failed Exception for only few > tasks, spark decides to recompute the partitions from the previous stage for > which output could not be fetched and will retry only the failed partitions > downstream. > Because block fetch order is indeterministic, the new upstream partition > computation can provide a slightly different value for a float/double sum > aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in > all of our attempts. The sort performed before repartition uses > UnsafeRow.hashcode for the row prefix which will be completely different even > with such 1 bit difference, leading to the sort being completely different in > the new upstream partition and thus target downstream partition for the > shuffled rows completely different as well. > Because sort becomes undeterministic and since only the failed dowstream > tasks are retried the resulting repartition will lead to duplicate rows as > well as missing rows. The solution brought by SPARK-23207 is broken. > So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to > make the entire job fail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47520) Rounding issues with sum of floats/doubles leads to incorrect data after repartition
[ https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-47520: --- Description: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double rounding, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggregation, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new upstream partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new upstream partition and thus target downstream partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. was: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double rounding, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggregation, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new upstream partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new upstream partition and thus target partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. > Rounding issues with sum of floats/doubles leads to incorrect data after > repartition > > > Key: SPARK-47520 > URL: https://issues.apache.org/jira/browse/SPARK-47520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.3.2, 3.5.0 >Reporter: William Montaz >Priority: Major > Labels: correctness > > We discovered an important correctness issue directly linked to SPARK-47024 > Even if SPARK-47024 has been considered 'Not a Problem' since it is linked > directly to floats and double rounding, it can still have drastic impacts > combined to spark.sql.execution.sortBeforeRepartition set to true (the > default) > We consistently reproduced the issue doing a GROUP BY with a SUM of float or > double aggregation, followed by a repartition (common case to produce bigger > files as output, either triggered by SQL hints or extensions like kyuubi). > If the repartition stage fails with Fetch Failed Exception for only few > tasks, spark decides to recompute the partitions from the previous stage for > which output could not be fetched and will retry only the failed partitions > downstream. > Because block fetch order is indeterministic, the new upstream partition > computation can provide a sligh
[jira] [Updated] (SPARK-47520) Rounding issues with sum of floats/doubles leads to incorrect data after repartition
[ https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-47520: --- Description: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double rounding, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggregation, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new upstream partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new upstream partition and thus target partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. was: We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double rounding, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggreagtion, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new before-shuffle partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new before-shuffle partition and thus destination partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. > Rounding issues with sum of floats/doubles leads to incorrect data after > repartition > > > Key: SPARK-47520 > URL: https://issues.apache.org/jira/browse/SPARK-47520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.3.2, 3.5.0 >Reporter: William Montaz >Priority: Major > Labels: correctness > > We discovered an important correctness issue directly linked to SPARK-47024 > Even if SPARK-47024 has been considered 'Not a Problem' since it is linked > directly to floats and double rounding, it can still have drastic impacts > combined to spark.sql.execution.sortBeforeRepartition set to true (the > default) > We consistently reproduced the issue doing a GROUP BY with a SUM of float or > double aggregation, followed by a repartition (common case to produce bigger > files as output, either triggered by SQL hints or extensions like kyuubi). > If the repartition stage fails with Fetch Failed Exception for only few > tasks, spark decides to recompute the partitions from the previous stage for > which output could not be fetched and will retry only the failed partitions > downstream. > Because block fetch order is indeterministic, the new upstream partition > computation can provide a slightly different value for a f
[jira] [Created] (SPARK-47520) Rounding issues with sum of floats/doubles leads to incorrect data after repartition
William Montaz created SPARK-47520: -- Summary: Rounding issues with sum of floats/doubles leads to incorrect data after repartition Key: SPARK-47520 URL: https://issues.apache.org/jira/browse/SPARK-47520 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.3.2, 3.4.2 Reporter: William Montaz We discovered an important correctness issue directly linked to SPARK-47024 Even if SPARK-47024 has been considered 'Not a Problem' since it is linked directly to floats and double rounding, it can still have drastic impacts combined to spark.sql.execution.sortBeforeRepartition set to true (the default) We consistently reproduced the issue doing a GROUP BY with a SUM of float or double aggreagtion, followed by a repartition (common case to produce bigger files as output, either triggered by SQL hints or extensions like kyuubi). If the repartition stage fails with Fetch Failed Exception for only few tasks, spark decides to recompute the partitions from the previous stage for which output could not be fetched and will retry only the failed partitions downstream. Because block fetch order is indeterministic, the new before-shuffle partition computation can provide a slightly different value for a float/double sum aggregation. We noticed a 1 bit difference in all of our attempts. The sort performed before repartition uses UnsafeRow.hashcode for the row prefix which will be completely different even with such 1 bit difference, leading to the sort being completely different in the new before-shuffle partition and thus destination partition for the shuffled rows completely different as well. Because sort becomes undeterministic and since only the failed dowstream tasks are retried the resulting repartition will lead to duplicate rows as well as missing rows. The solution brought by SPARK-23207 is broken. So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to make the entire job fail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested
[ https://issues.apache.org/jira/browse/SPARK-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679788#comment-16679788 ] William Montaz commented on SPARK-25975: Associated pull request https://github.com/apache/spark/pull/22981 > Spark History does not display necessarily the incomplete applications when > requested > - > > Key: SPARK-25975 > URL: https://issues.apache.org/jira/browse/SPARK-25975 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > Filtering of incomplete applications is made in javascript against the > response returned by the API. The problem is that if the returned result is > not big enough (because of spark.history.ui.maxApplications), it might not > contain incomplete applications. > We can call the API with status RUNNING or COMPLETED depending on the view we > want to fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested
[ https://issues.apache.org/jira/browse/SPARK-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-25975: --- Attachment: fix.patch > Spark History does not display necessarily the incomplete applications when > requested > - > > Key: SPARK-25975 > URL: https://issues.apache.org/jira/browse/SPARK-25975 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > Filtering of incomplete applications is made in javascript against the > response returned by the API. The problem is that if the returned result is > not big enough (because of spark.history.ui.maxApplications), it might not > contain incomplete applications. > We can call the API with status RUNNING or COMPLETED depending on the view we > want to fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679772#comment-16679772 ] William Montaz commented on SPARK-25973: Ok created https://github.com/apache/spark/pull/22980 > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679853#comment-16679853 ] William Montaz commented on SPARK-25973: New pull request on master branch https://github.com/apache/spark/pull/22982 > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-25973: --- Priority: Minor (was: Major) > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator> This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested
William Montaz created SPARK-25975: -- Summary: Spark History does not display necessarily the incomplete applications when requested Key: SPARK-25975 URL: https://issues.apache.org/jira/browse/SPARK-25975 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.3.2 Reporter: William Montaz Attachments: fix.patch Filtering of incomplete applications is made in javascript against the response returned by the API. The problem is that if the returned result is not big enough (because of spark.history.ui.maxApplications), it might not contain incomplete applications. We can call the API with status RUNNING or COMPLETED depending on the view we want to fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25973) Spark History Main page performance improvment
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-25973: --- Attachment: fix.patch > Spark History Main page performance improvment > -- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Major > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator> This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25973) Spark History Main page performance improvment
William Montaz created SPARK-25973: -- Summary: Spark History Main page performance improvment Key: SPARK-25973 URL: https://issues.apache.org/jira/browse/SPARK-25973 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.3.2 Reporter: William Montaz HistoryPage.scala counts applications (with a predicate depending on if it is displaying incomplete or complete applications) to check if it must display the dataTable. Since it only checks if allAppsSize > 0, we could use exists method on the iterator> This way we stop iterating at the first occurence found. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-25973: --- Summary: Spark History Main page performance improvement (was: Spark History Main page performance improvment) > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Major > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator> This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement
[ https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-25973: --- Description: HistoryPage.scala counts applications (with a predicate depending on if it is displaying incomplete or complete applications) to check if it must display the dataTable. Since it only checks if allAppsSize > 0, we could use exists method on the iterator. This way we stop iterating at the first occurence found. was: HistoryPage.scala counts applications (with a predicate depending on if it is displaying incomplete or complete applications) to check if it must display the dataTable. Since it only checks if allAppsSize > 0, we could use exists method on the iterator> This way we stop iterating at the first occurence found. > Spark History Main page performance improvement > --- > > Key: SPARK-25973 > URL: https://issues.apache.org/jira/browse/SPARK-25973 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2 >Reporter: William Montaz >Priority: Minor > Attachments: fix.patch > > > HistoryPage.scala counts applications (with a predicate depending on if it is > displaying incomplete or complete applications) to check if it must display > the dataTable. > Since it only checks if allAppsSize > 0, we could use exists method on the > iterator. This way we stop iterating at the first occurence found. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-24150: --- Description: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that threads will eventually synchronise on different monitors (because they will synchronise on different objects which references have been assigned to "applications"), breaking the initial synchronisation intent. This has even greater chance to reproduce when number_new_log_files > replayExecutor_pool_size If such log disappears (it will not be present in the list "applications"), it will be impossible to read it from the UI (being in the list "applications" is a mandatory check to avoid getting a 404) Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses was: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that threads will eventually synchronise on different monitors (because they will synchronise on different objects which references have been assigned to "applications"), breaking the initial synchronisation intent. This has even greater chance to reproduce when number_new_log_files > replayExecutor_pool_size Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses > Race condition in FsHistoryProvider > --- > > Key: SPARK-24150 > URL: https://issues.apache.org/jira/browse/SPARK-24150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: William Montaz >Priority: Major > > There exist a race condition in checkLogs method between threads of > replayExecutor. They use the field "applications" to synchronise, but they > also update that field. > The problem is that threads will eventually synchronise on different monitors > (because they will synchronise on different objects which references have > been assigned to "applications"), breaking the initial synchronisation > intent. This has even greater chance to reproduce when number_new_log_files > > replayExecutor_pool_size > If such log disappears (it will not be present in the list "applications"), > it will be impossible to read it from the UI (being in the list > "applications" is a mandatory check to avoid getting a 404) > Workaround: > * use a permanent object as a monitor on which to synchronise (or > synchronise on `this`) > * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-24150: --- Description: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that threads will eventually synchronise on different monitors (because they will synchronise on different objects which references have been assigned to "applications"), breaking the initial synchronisation intent. This has even greater chance to reproduce when number_new_log_files > replayExecutor_pool_size Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses was: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that threads will eventually synchronise on different monitors (because they will synchronise on different objects which references that have been assigned to "applications"), breaking the initial synchronisation intent. This has even greater chance to reproduce when number_new_log_files > replayExecutor_pool_size Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses > Race condition in FsHistoryProvider > --- > > Key: SPARK-24150 > URL: https://issues.apache.org/jira/browse/SPARK-24150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: William Montaz >Priority: Major > > There exist a race condition in checkLogs method between threads of > replayExecutor. They use the field "applications" to synchronise, but they > also update that field. > The problem is that threads will eventually synchronise on different monitors > (because they will synchronise on different objects which references have > been assigned to "applications"), breaking the initial synchronisation > intent. This has even greater chance to reproduce when number_new_log_files > > replayExecutor_pool_size > Workaround: > * use a permanent object as a monitor on which to synchronise (or > synchronise on `this`) > * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-24150: --- Priority: Major (was: Minor) > Race condition in FsHistoryProvider > --- > > Key: SPARK-24150 > URL: https://issues.apache.org/jira/browse/SPARK-24150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: William Montaz >Priority: Major > > There exist a race condition in checkLogs method between threads of > replayExecutor. They use the field "applications" to synchronise, but they > also update that field. > The problem is that threads will eventually synchronise on different monitors > (because they will synchronise on different objects which references that > have been assigned to "applications"), breaking the initial synchronisation > intent. This has even greater chance to reproduce when number_new_log_files > > replayExecutor_pool_size > Workaround: > * use a permanent object as a monitor on which to synchronise (or > synchronise on `this`) > * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-24150: --- Description: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that threads will eventually synchronise on different monitors (because they will synchronise on different objects which references that have been assigned to "applications"), breaking the initial synchronisation intent. This has even greater chance to reproduce when number_new_log_files > replayExecutor_pool_size Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses was: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that if the number of tasks (the number of new log files to replay and add to the applications list) is greater than the number of threads in the pool, threads will eventually synchronise on different monitors (because they will synchronise on different objects which references that have been assigned to "applications"), breaking the initial synchronisation intent. Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses > Race condition in FsHistoryProvider > --- > > Key: SPARK-24150 > URL: https://issues.apache.org/jira/browse/SPARK-24150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: William Montaz >Priority: Minor > > There exist a race condition in checkLogs method between threads of > replayExecutor. They use the field "applications" to synchronise, but they > also update that field. > The problem is that threads will eventually synchronise on different monitors > (because they will synchronise on different objects which references that > have been assigned to "applications"), breaking the initial synchronisation > intent. This has even greater chance to reproduce when number_new_log_files > > replayExecutor_pool_size > Workaround: > * use a permanent object as a monitor on which to synchronise (or > synchronise on `this`) > * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-24150: --- Description: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that if the number of tasks (the number of new log files to replay and add to the applications list) is greater than the number of threads in the pool, threads will eventually synchronise on different monitors (because they will synchronise on different objects which references that have been assigned to "applications"), breaking the initial synchronisation intent. Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses was: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that if the number of tasks (the number of new log files to replay and add to the applications list) is greater than the number of threads in the pool, there is a great chance that a thread will try to synchronise on an updated version of applications (since it is volatile and updated) while some are still being synchronised on an old reference of applications. There the race condition happens. Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses > Race condition in FsHistoryProvider > --- > > Key: SPARK-24150 > URL: https://issues.apache.org/jira/browse/SPARK-24150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: William Montaz >Priority: Minor > > There exist a race condition in checkLogs method between threads of > replayExecutor. They use the field "applications" to synchronise, but they > also update that field. > The problem is that if the number of tasks (the number of new log files to > replay and add to the applications list) is greater than the number of > threads in the pool, threads will eventually synchronise on different > monitors (because they will synchronise on different objects which references > that have been assigned to "applications"), breaking the initial > synchronisation intent. > Workaround: > * use a permanent object as a monitor on which to synchronise (or > synchronise on `this`) > * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-24150: --- Description: There exist a race condition in checkLogs method between threads of replayExecutor. They use the field "applications" to synchronise, but they also update that field. The problem is that if the number of tasks (the number of new log files to replay and add to the applications list) is greater than the number of threads in the pool, there is a great chance that a thread will try to synchronise on an updated version of applications (since it is volatile and updated) while some are still being synchronised on an old reference of applications. There the race condition happens. Workaround: * use a permanent object as a monitor on which to synchronise (or synchronise on `this`) * keep volatile field for all other read accesses was: There exist a race condition between the method checkLogs and cleanLogs. cleanLogs can read the field applications while it is concurrently processed by checkLogs. It is possible that checkLogs added new fetched logs, sets applications and this is erased by cleanLogs having an old version of applications. The problem is that the fetched log won't appear in applications anymore and it will then be impossible to display the corresponding application in the History Server, since it must be in the LinkedList applications. Workaround: * use a permanent object as a monitor on which to synchronise * keep volatile field for all other read accesses > Race condition in FsHistoryProvider > --- > > Key: SPARK-24150 > URL: https://issues.apache.org/jira/browse/SPARK-24150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: William Montaz >Priority: Minor > > There exist a race condition in checkLogs method between threads of > replayExecutor. They use the field "applications" to synchronise, but they > also update that field. > The problem is that if the number of tasks (the number of new log files to > replay and add to the applications list) is greater than the number of > threads in the pool, there is a great chance that a thread will try to > synchronise on an updated version of applications (since it is volatile and > updated) while some are still being synchronised on an old reference of > applications. There the race condition happens. > Workaround: > * use a permanent object as a monitor on which to synchronise (or > synchronise on `this`) > * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Montaz updated SPARK-24150: --- Description: There exist a race condition between the method checkLogs and cleanLogs. cleanLogs can read the field applications while it is concurrently processed by checkLogs. It is possible that checkLogs added new fetched logs, sets applications and this is erased by cleanLogs having an old version of applications. The problem is that the fetched log won't appear in applications anymore and it will then be impossible to display the corresponding application in the History Server, since it must be in the LinkedList applications. Workaround: * use a permanent object as a monitor on which to synchronise * keep volatile field for all other read accesses was: There exist a race condition between the method checkLogs and cleanLogs. Workaround: * use a permanent object as a monitor on which to synchronise * keep volatile field for all other read accesses > Race condition in FsHistoryProvider > --- > > Key: SPARK-24150 > URL: https://issues.apache.org/jira/browse/SPARK-24150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: William Montaz >Priority: Minor > > There exist a race condition between the method checkLogs and cleanLogs. > cleanLogs can read the field applications while it is concurrently processed > by checkLogs. It is possible that checkLogs added new fetched logs, sets > applications and this is erased by cleanLogs having an old version of > applications. The problem is that the fetched log won't appear in > applications anymore and it will then be impossible to display the > corresponding application in the History Server, since it must be in the > LinkedList applications. > Workaround: > * use a permanent object as a monitor on which to synchronise > * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24150) Race condition in FsHistoryProvider
William Montaz created SPARK-24150: -- Summary: Race condition in FsHistoryProvider Key: SPARK-24150 URL: https://issues.apache.org/jira/browse/SPARK-24150 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: William Montaz There exist a race condition between the method checkLogs and cleanLogs. Workaround: * use a permanent object as a monitor on which to synchronise * keep volatile field for all other read accesses -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org