[ 
https://issues.apache.org/jira/browse/SPARK-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6188:
-----------------------------
    Shepherd:   (was: Josh Rosen)
    Assignee: Theodore Vasiloudis

> Instance types can be mislabeled when re-starting cluster with default 
> arguments
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-6188
>                 URL: https://issues.apache.org/jira/browse/SPARK-6188
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2
>    Affects Versions: 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>            Priority: Minor
>             Fix For: 1.4.0
>
>
> This was discovered when investigating 
> https://issues.apache.org/jira/browse/SPARK-5838.
> In short, when restarting a cluster that you launched with an alternative 
> instance type, you have to provide the instance type(s) again in the 
> "/spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name>" 
> command. Otherwise it gets set to the default m1.large.
> This then affects the setup of the machines.
> I'll submit a pull request that takes cares of this, without the user needing 
> to provide the instance type(s) again.
> EDIT: 
> Example case where this becomes a problem:
> 1. User launches a cluster with instances with 1 disk, ex. m3.large.
> 2. The user stops the cluster.
> 3. When the user restarts the cluster with the start command without 
> providing the instance type, the setup is performed using the default 
> instance type, m1.large, which assumes 2 disks present in the machine.
> 4. The SPARK_LOCAL_DIRS is then set to "mnt/spark,mnt2/spark". /mnt2 
> corresponds to the snapshot partition in a m3.large instance, which is only 
> 8GB in size. When the user runs jobs that shuffle data, this partition fills 
> up quickly, resulting in failed jobs due to "No space left on device" errors.
> Apart from this example one could come up with other examples where the setup 
> of the machines is wrong, due to assuming that they are of type m1.large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to