[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-08 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158320#comment-16158320
 ] 

Saisai Shao commented on SPARK-21942:
-

Personally I would like to fail fast if such things happened, here it happened 
to clean the root folder and using {{mkdirs}} can handle this issue, but if 
some persistent block or shuffle index file is removed (because it is closed), 
I think there's no way to handle it. So instead of trying to workaround it, 
exposing an exception to user might be more useful, and will let user to know 
the issue earlier.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-08 Thread Ruslan Shestopalyuk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158304#comment-16158304
 ] 

Ruslan Shestopalyuk commented on SPARK-21942:
-

[~jerryshao] I believe the only objective reason here would be to make the 
Spark code more robust. 

Regarding the rest - I agree it's not a valid issue, since if problem like this 
happens, one can always spend some time debugging the Spark code and realize 
what a workaround could be.

Also, hopefully this very page gets indexed in the search engines, so maybe 
even that won't be needed :) 


> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-08 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158271#comment-16158271
 ] 

Saisai Shao commented on SPARK-21942:
-

{quote}
https://github.com/search?utf8=%E2%9C%93&q=filename%3Aspark-defaults.conf++NOT+spark.local.dir&type=Code

shows 2000+ repos that omit the `spark.local.dir` setting altogether, which 
means they are using `/tmp`, even though it's not a good default choice.
Which of course does not prove anything, since those are not necessarily 
"production environments".
{quote}

[~rshest] you can always find out reasons, but I don't think this is a valid 
issue.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-07 Thread Ruslan Shestopalyuk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157529#comment-16157529
 ] 

Ruslan Shestopalyuk commented on SPARK-21942:
-

I can't comment on how common it is in general, but here's one very likely 
scenario:
* an online service is using a spark model to score the requests
* it loads the model on startup
* the model may get updated (retrained) periodically, let's say every two 
weeks, and the service picks up the new one

But yes, it's true that for this to happen, the same spark context has to stay 
alive for relatively long time, so it may not be that common at all.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157503#comment-16157503
 ] 

Sean Owen commented on SPARK-21942:
---

It's possible that these files aren't touched for a very long time, yet still 
in use, I suppose (existence proof here). But is that common? It seems pretty 
exceptional. Maybe I'm missing why that's at all common.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-07 Thread Ruslan Shestopalyuk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157489#comment-16157489
 ] 

Ruslan Shestopalyuk commented on SPARK-21942:
-

Sean, OS won't ever delete the files that are currently open - in fact it only 
deletes files that have not been _accessed_ for several days.

For example, in case of a RHEL Fedora distribution (which is a base for the 
standard AWS Linux image), the corresponding cron job config looks like this:

{code:bash}
$ cat /etc/cron.daily/tmpwatch 

#! /bin/sh
flags=-umc
/usr/sbin/tmpwatch "$flags" -x /tmp/.X11-unix -x /tmp/.XIM-unix \
-x /tmp/.font-unix -x /tmp/.ICE-unix -x /tmp/.Test-unix \
-X '/tmp/hsperfdata_*' 10d /tmp
/usr/sbin/tmpwatch "$flags" 30d /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
if [ -d "$d" ]; then
/usr/sbin/tmpwatch "$flags" -f 30d "$d"
fi
done
{code}

So it runs the cron task daily, executing the 
[tmpwatch](https://linux.die.net/man/8/tmpwatch) utility, telling it in 
particular:
* for _/tmp_ to delete all files that have not been _accessed_ for more than 10 
days
* the same for _/var/tmp_, but not accessed for 30 days

So in case of the spark scratch folder, it will be purged if it has not been 
accessed (writte.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157041#comment-16157041
 ] 

Sean Owen commented on SPARK-21942:
---

I suppose my real doubt is whether this would actually resolve the problem. 
mkdir -> mkdirs isn't a big deal of a change, but if an OS process is deleting 
files that Spark is still using, I don't know that there's any full 'fix' for 
that. I don't really object to the change, just don't think it really lets this 
situation work.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org