Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Gourav Sengupta
Hi Nicolas,

thanks a ton for your kind response, I will surely try this out.

Regards,
Gourav Sengupta

On Sun, Aug 29, 2021 at 11:01 PM Nicolas Paris 
wrote:

> as a workaround turn off pruning :
>
> spark.sql.hive.metastorePartitionPruning false
> spark.sql.hive.convertMetastoreParquet false
>
> see
> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45
>
> On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote:
> > Hi,
> >
> > I received a response from AWS, this is an issue with EMR, and they are
> > working on resolving the issue I believe.
> >
> > Thanks and Regards,
> > Gourav Sengupta
> >
> > On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
> > gourav.sengupta.develo...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > the query still gives the same error if we write "SELECT * FROM
> table_name
> > > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
> > >
> > > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
> > >
> > >
> > > Thanks and Regards,
> > > Gourav Sengupta
> > >
> > > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
> > >
> > >> Date handling was tightened up in Spark 3. I think you need to
> compare to
> > >> a date literal, not a string literal.
> > >>
> > >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> > >> gourav.sengupta.develo...@gmail.com> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as
> "SELECT
> > >>> * FROM  WHERE  > '2021-03-01'" the
> query
> > >>> is failing with error:
> > >>>
> > >>>
> ---
> > >>> pyspark.sql.utils.AnalysisException:
> > >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException:
> Unsupported
> > >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400;
> Error
> > >>> Code: InvalidInputException; Request ID:
> > >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> > >>>
> > >>>
> ---
> > >>>
> > >>> The above query works fine in all previous versions of SPARK.
> > >>>
> > >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone
> please
> > >>> let me know how to write this query.
> > >>>
> > >>> Also if this is the expected behaviour I think that a lot of users
> will
> > >>> have to make these changes in their existing code making transition
> to
> > >>> SPARK 3.1.1 expensive I think.
> > >>>
> > >>> Regards,
> > >>> Gourav Sengupta
> > >>>
> > >>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Nicolas Paris
as a workaround turn off pruning :

spark.sql.hive.metastorePartitionPruning false
spark.sql.hive.convertMetastoreParquet false

see 
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45

On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote:
> Hi,
>
> I received a response from AWS, this is an issue with EMR, and they are
> working on resolving the issue I believe.
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
> gourav.sengupta.develo...@gmail.com> wrote:
>
> > Hi,
> >
> > the query still gives the same error if we write "SELECT * FROM table_name
> > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
> >
> > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
> >
> >
> > Thanks and Regards,
> > Gourav Sengupta
> >
> > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
> >
> >> Date handling was tightened up in Spark 3. I think you need to compare to
> >> a date literal, not a string literal.
> >>
> >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> >> gourav.sengupta.develo...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT
> >>> * FROM  WHERE  > '2021-03-01'" the query
> >>> is failing with error:
> >>>
> >>> ---
> >>> pyspark.sql.utils.AnalysisException:
> >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
> >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
> >>> Code: InvalidInputException; Request ID:
> >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> >>>
> >>> ---
> >>>
> >>> The above query works fine in all previous versions of SPARK.
> >>>
> >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
> >>> let me know how to write this query.
> >>>
> >>> Also if this is the expected behaviour I think that a lot of users will
> >>> have to make these changes in their existing code making transition to
> >>> SPARK 3.1.1 expensive I think.
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: AWS EMR SPARK 3.1.1 date issues

2021-08-24 Thread Gourav Sengupta
Hi,

I received a response from AWS, this is an issue with EMR, and they are
working on resolving the issue I believe.

Thanks and Regards,
Gourav Sengupta

On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
gourav.sengupta.develo...@gmail.com> wrote:

> Hi,
>
> the query still gives the same error if we write "SELECT * FROM table_name
> WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
>
> Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
>
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
>
>> Date handling was tightened up in Spark 3. I think you need to compare to
>> a date literal, not a string literal.
>>
>> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
>> gourav.sengupta.develo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT
>>> * FROM  WHERE  > '2021-03-01'" the query
>>> is failing with error:
>>>
>>> ---
>>> pyspark.sql.utils.AnalysisException:
>>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
>>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
>>> Code: InvalidInputException; Request ID:
>>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
>>>
>>> ---
>>>
>>> The above query works fine in all previous versions of SPARK.
>>>
>>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
>>> let me know how to write this query.
>>>
>>> Also if this is the expected behaviour I think that a lot of users will
>>> have to make these changes in their existing code making transition to
>>> SPARK 3.1.1 expensive I think.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>


Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Gourav Sengupta
Hi,

the query still gives the same error if we write "SELECT * FROM table_name
WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".

Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.


Thanks and Regards,
Gourav Sengupta

On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:

> Date handling was tightened up in Spark 3. I think you need to compare to
> a date literal, not a string literal.
>
> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> gourav.sengupta.develo...@gmail.com> wrote:
>
>> Hi,
>>
>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT *
>> FROM  WHERE  > '2021-03-01'" the query is
>> failing with error:
>>
>> ---
>> pyspark.sql.utils.AnalysisException:
>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
>> Code: InvalidInputException; Request ID:
>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
>>
>> ---
>>
>> The above query works fine in all previous versions of SPARK.
>>
>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
>> let me know how to write this query.
>>
>> Also if this is the expected behaviour I think that a lot of users will
>> have to make these changes in their existing code making transition to
>> SPARK 3.1.1 expensive I think.
>>
>> Regards,
>> Gourav Sengupta
>>
>


Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Sean Owen
Date handling was tightened up in Spark 3. I think you need to compare to a
date literal, not a string literal.

On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
gourav.sengupta.develo...@gmail.com> wrote:

> Hi,
>
> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT *
> FROM  WHERE  > '2021-03-01'" the query is
> failing with error:
> ---
> pyspark.sql.utils.AnalysisException:
> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
> Code: InvalidInputException; Request ID:
> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> ---
>
> The above query works fine in all previous versions of SPARK.
>
> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
> let me know how to write this query.
>
> Also if this is the expected behaviour I think that a lot of users will
> have to make these changes in their existing code making transition to
> SPARK 3.1.1 expensive I think.
>
> Regards,
> Gourav Sengupta
>


Re: Aws

2019-02-08 Thread Pedro Tuero
Hi Noritaka,

I start clusters from Java API.
Clusters running on 5.16 have not manual configurations in the Emr console
Configuration tab, so I assume the value of this property should be the
default on 5.16.
I enabled maximize resource allocation because otherwise, the number of
cores automatically assigned (without assigning spark.executor.cores
manually) was always one per executor.

I already use the same configurations. I used the same scripts and
configuration files for running the same job with same input data with the
same configuration, only changing the binaries with my own code which
include launching the clusters using emr 5.20 release label.

Anyway, setting maximize resource allocation seems to have helped with the
cores distribution enough.
Some jobs take even less than before.
Now I'm stuck analyzing a case where the number of tasks created seems to
be the problem. I have posted in this forum another thread about that
recently.

Regards,
Pedro


El jue., 7 de feb. de 2019 a la(s) 21:37, Noritaka Sekiyama (
moomind...@gmail.com) escribió:

> Hi Pedro,
>
> It seems that you disabled maximize resource allocation in 5.16, but
> enabled in 5.20.
> This config can be different based on how you start EMR cluster (via quick
> wizard, advanced wizard in console, or CLI/API).
> You can see that in EMR console Configuration tab.
>
> Please compare spark properties (especially spark.executor.cores,
> spark.executor.memory, spark.dynamicAllocation.enabled, etc.)  between
> your two Spark cluster with different version of EMR.
> You can see them from Spark web UI’s environment tab or log files.
> Then please try with the same properties against the same dataset with the
> same deployment mode (cluster or client).
>
> Even in EMR, you can configure num of cores and memory of driver/executors
> in config files, arguments in spark-submit, and inside Spark app if you
> need.
>
>
> Warm regards,
> Nori
>
> 2019年2月8日(金) 8:16 Hiroyuki Nagata :
>
>> Hi,
>> thank you Pedro
>>
>> I tested maximizeResourceAllocation option. When it's enabled, it seems
>> Spark utilized their cores fully. However the performance is not so
>> different from default setting.
>>
>> I consider to use s3-distcp for uploading files. And, I think
>> table(dataframe) caching is also effectiveness.
>>
>> Regards,
>> Hiroyuki
>>
>> 2019年2月2日(土) 1:12 Pedro Tuero :
>>
>>> Hi Hiroyuki, thanks for the answer.
>>>
>>> I found a solution for the cores per executor configuration:
>>> I set this configuration to true:
>>>
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
>>> Probably it was true by default at version 5.16, but I didn't find when
>>> it has changed.
>>> In the same link, it says that dynamic allocation is true by default. I
>>> thought it would do the trick but reading again I think it is related to
>>> the number of executors rather than the number of cores.
>>>
>>> But the jobs are still taking more than before.
>>> Watching application history,  I see these differences:
>>> For the same job, the same kind of instances types, default (aws
>>> managed) configuration for executors, cores, and memory:
>>> Instances:
>>> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances
>>> * 4 cores).
>>>
>>> With 5.16:
>>> - 24 executors  (4 in each instance, including the one who also had the
>>> driver).
>>> - 4 cores each.
>>> - 2.7  * 2 (Storage + on-heap storage) memory each.
>>> - 1 executor per core, but at the same time  4 cores per executor (?).
>>> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
>>> - Total Elapsed Time: 6 minutes
>>> With 5.20:
>>> - 5 executors (1 in each instance, 0 in the instance with the driver).
>>> - 4 cores each.
>>> - 11.9  * 2 (Storage + on-heap storage) memory each.
>>> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
>>> - Total Elapsed Time: 8 minutes
>>>
>>>
>>> I don't understand the configuration of 5.16, but it works better.
>>> It seems that in 5.20, a full instance is wasted with the driver only,
>>> while it could also contain an executor.
>>>
>>>
>>> Regards,
>>> Pedro.
>>>
>>>
>>>
>>> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
>>> escribió:
>>>
 Hi, Pedro


 I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
 performance tuning.

 Do you configure dynamic allocation ?

 FYI:

 https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

 I've not tested it yet. I guess spark-submit needs to specify number of
 executors.

 Regards,
 Hiroyuki

 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:

> Hi guys,
> I use to run spark jobs in Aws emr.
> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
> 2.4.0).
> I've noticed that a lot of steps are taking longer than before.
> I think it is related to the automatic 

Re: Aws

2019-02-07 Thread Noritaka Sekiyama
Hi Pedro,

It seems that you disabled maximize resource allocation in 5.16, but
enabled in 5.20.
This config can be different based on how you start EMR cluster (via quick
wizard, advanced wizard in console, or CLI/API).
You can see that in EMR console Configuration tab.

Please compare spark properties (especially spark.executor.cores,
spark.executor.memory, spark.dynamicAllocation.enabled, etc.)  between your
two Spark cluster with different version of EMR.
You can see them from Spark web UI’s environment tab or log files.
Then please try with the same properties against the same dataset with the
same deployment mode (cluster or client).

Even in EMR, you can configure num of cores and memory of driver/executors
in config files, arguments in spark-submit, and inside Spark app if you
need.


Warm regards,
Nori

2019年2月8日(金) 8:16 Hiroyuki Nagata :

> Hi,
> thank you Pedro
>
> I tested maximizeResourceAllocation option. When it's enabled, it seems
> Spark utilized their cores fully. However the performance is not so
> different from default setting.
>
> I consider to use s3-distcp for uploading files. And, I think
> table(dataframe) caching is also effectiveness.
>
> Regards,
> Hiroyuki
>
> 2019年2月2日(土) 1:12 Pedro Tuero :
>
>> Hi Hiroyuki, thanks for the answer.
>>
>> I found a solution for the cores per executor configuration:
>> I set this configuration to true:
>>
>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
>> Probably it was true by default at version 5.16, but I didn't find when
>> it has changed.
>> In the same link, it says that dynamic allocation is true by default. I
>> thought it would do the trick but reading again I think it is related to
>> the number of executors rather than the number of cores.
>>
>> But the jobs are still taking more than before.
>> Watching application history,  I see these differences:
>> For the same job, the same kind of instances types, default (aws managed)
>> configuration for executors, cores, and memory:
>> Instances:
>> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
>> 4 cores).
>>
>> With 5.16:
>> - 24 executors  (4 in each instance, including the one who also had the
>> driver).
>> - 4 cores each.
>> - 2.7  * 2 (Storage + on-heap storage) memory each.
>> - 1 executor per core, but at the same time  4 cores per executor (?).
>> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
>> - Total Elapsed Time: 6 minutes
>> With 5.20:
>> - 5 executors (1 in each instance, 0 in the instance with the driver).
>> - 4 cores each.
>> - 11.9  * 2 (Storage + on-heap storage) memory each.
>> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
>> - Total Elapsed Time: 8 minutes
>>
>>
>> I don't understand the configuration of 5.16, but it works better.
>> It seems that in 5.20, a full instance is wasted with the driver only,
>> while it could also contain an executor.
>>
>>
>> Regards,
>> Pedro.
>>
>>
>>
>> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
>> escribió:
>>
>>> Hi, Pedro
>>>
>>>
>>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>>> performance tuning.
>>>
>>> Do you configure dynamic allocation ?
>>>
>>> FYI:
>>>
>>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>>
>>> I've not tested it yet. I guess spark-submit needs to specify number of
>>> executors.
>>>
>>> Regards,
>>> Hiroyuki
>>>
>>> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>>>
 Hi guys,
 I use to run spark jobs in Aws emr.
 Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
 2.4.0).
 I've noticed that a lot of steps are taking longer than before.
 I think it is related to the automatic configuration of cores by
 executor.
 In version 5.16, some executors toke more cores if the instance allows
 it.
 Let say, if an instance had 8 cores and 40gb of ram, and ram configured
 by executor was 10gb, then aws emr automatically assigned 2 cores by
 executor.
 Now in label 5.20, unless I configure the number of cores manually,
 only one core is assigned per executor.

 I don't know if it is related to Spark 2.4.0 or if it is something
 managed by aws...
 Does anyone know if there is a way to automatically use more cores when
 it is physically possible?

 Thanks,
 Peter.

>>>


Re: Aws

2019-02-07 Thread Hiroyuki Nagata
Hi,
thank you Pedro

I tested maximizeResourceAllocation option. When it's enabled, it seems
Spark utilized their cores fully. However the performance is not so
different from default setting.

I consider to use s3-distcp for uploading files. And, I think
table(dataframe) caching is also effectiveness.

Regards,
Hiroyuki

2019年2月2日(土) 1:12 Pedro Tuero :

> Hi Hiroyuki, thanks for the answer.
>
> I found a solution for the cores per executor configuration:
> I set this configuration to true:
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
> Probably it was true by default at version 5.16, but I didn't find when it
> has changed.
> In the same link, it says that dynamic allocation is true by default. I
> thought it would do the trick but reading again I think it is related to
> the number of executors rather than the number of cores.
>
> But the jobs are still taking more than before.
> Watching application history,  I see these differences:
> For the same job, the same kind of instances types, default (aws managed)
> configuration for executors, cores, and memory:
> Instances:
> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
> 4 cores).
>
> With 5.16:
> - 24 executors  (4 in each instance, including the one who also had the
> driver).
> - 4 cores each.
> - 2.7  * 2 (Storage + on-heap storage) memory each.
> - 1 executor per core, but at the same time  4 cores per executor (?).
> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
> - Total Elapsed Time: 6 minutes
> With 5.20:
> - 5 executors (1 in each instance, 0 in the instance with the driver).
> - 4 cores each.
> - 11.9  * 2 (Storage + on-heap storage) memory each.
> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
> - Total Elapsed Time: 8 minutes
>
>
> I don't understand the configuration of 5.16, but it works better.
> It seems that in 5.20, a full instance is wasted with the driver only,
> while it could also contain an executor.
>
>
> Regards,
> Pedro.
>
>
>
> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
> escribió:
>
>> Hi, Pedro
>>
>>
>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>> performance tuning.
>>
>> Do you configure dynamic allocation ?
>>
>> FYI:
>>
>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>> I've not tested it yet. I guess spark-submit needs to specify number of
>> executors.
>>
>> Regards,
>> Hiroyuki
>>
>> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>>
>>> Hi guys,
>>> I use to run spark jobs in Aws emr.
>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>> 2.4.0).
>>> I've noticed that a lot of steps are taking longer than before.
>>> I think it is related to the automatic configuration of cores by
>>> executor.
>>> In version 5.16, some executors toke more cores if the instance allows
>>> it.
>>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>>> executor.
>>> Now in label 5.20, unless I configure the number of cores manually, only
>>> one core is assigned per executor.
>>>
>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>> managed by aws...
>>> Does anyone know if there is a way to automatically use more cores when
>>> it is physically possible?
>>>
>>> Thanks,
>>> Peter.
>>>
>>


Re: Aws

2019-02-01 Thread Pedro Tuero
Hi Hiroyuki, thanks for the answer.

I found a solution for the cores per executor configuration:
I set this configuration to true:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
Probably it was true by default at version 5.16, but I didn't find when it
has changed.
In the same link, it says that dynamic allocation is true by default. I
thought it would do the trick but reading again I think it is related to
the number of executors rather than the number of cores.

But the jobs are still taking more than before.
Watching application history,  I see these differences:
For the same job, the same kind of instances types, default (aws managed)
configuration for executors, cores, and memory:
Instances:
6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances * 4
cores).

With 5.16:
- 24 executors  (4 in each instance, including the one who also had the
driver).
- 4 cores each.
- 2.7  * 2 (Storage + on-heap storage) memory each.
- 1 executor per core, but at the same time  4 cores per executor (?).
- Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
- Total Elapsed Time: 6 minutes
With 5.20:
- 5 executors (1 in each instance, 0 in the instance with the driver).
- 4 cores each.
- 11.9  * 2 (Storage + on-heap storage) memory each.
- Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
- Total Elapsed Time: 8 minutes


I don't understand the configuration of 5.16, but it works better.
It seems that in 5.20, a full instance is wasted with the driver only,
while it could also contain an executor.


Regards,
Pedro.



l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
escribió:

> Hi, Pedro
>
>
> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
> performance tuning.
>
> Do you configure dynamic allocation ?
>
> FYI:
>
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
> I've not tested it yet. I guess spark-submit needs to specify number of
> executors.
>
> Regards,
> Hiroyuki
>
> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>
>> Hi guys,
>> I use to run spark jobs in Aws emr.
>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>> 2.4.0).
>> I've noticed that a lot of steps are taking longer than before.
>> I think it is related to the automatic configuration of cores by executor.
>> In version 5.16, some executors toke more cores if the instance allows it.
>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>> executor.
>> Now in label 5.20, unless I configure the number of cores manually, only
>> one core is assigned per executor.
>>
>> I don't know if it is related to Spark 2.4.0 or if it is something
>> managed by aws...
>> Does anyone know if there is a way to automatically use more cores when
>> it is physically possible?
>>
>> Thanks,
>> Peter.
>>
>


Re: Aws

2019-01-31 Thread Hiroyuki Nagata
Hi, Pedro


I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
performance tuning.

Do you configure dynamic allocation ?

FYI:
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

I've not tested it yet. I guess spark-submit needs to specify number of
executors.

Regards,
Hiroyuki

2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:

> Hi guys,
> I use to run spark jobs in Aws emr.
> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark 2.4.0).
> I've noticed that a lot of steps are taking longer than before.
> I think it is related to the automatic configuration of cores by executor.
> In version 5.16, some executors toke more cores if the instance allows it.
> Let say, if an instance had 8 cores and 40gb of ram, and ram configured by
> executor was 10gb, then aws emr automatically assigned 2 cores by executor.
> Now in label 5.20, unless I configure the number of cores manually, only
> one core is assigned per executor.
>
> I don't know if it is related to Spark 2.4.0 or if it is something managed
> by aws...
> Does anyone know if there is a way to automatically use more cores when it
> is physically possible?
>
> Thanks,
> Peter.
>


Re: AWS credentials needed while trying to read a model from S3 in Spark

2018-05-09 Thread Srinath C
You could use IAM roles in AWS to access the data in S3 without credentials.
See this link

and
this link

for an example.

Regards,
Srinath.


On Thu, May 10, 2018 at 7:04 AM, Mina Aslani  wrote:

> Hi,
>
> I am trying to load a ML model from AWS S3 in my spark app running in a
> docker container, however I need to pass the AWS credentials.
> My questions is, why do I need to pass the credentials in the path?
> And what is the workaround?
>
> Best regards,
> Mina
>


Re: AWS CLI --jars comma problem

2015-12-07 Thread Akhil Das
Not a direct answer but you can create a big fat jar combining all the
classes in the three jars and pass it.

Thanks
Best Regards

On Thu, Dec 3, 2015 at 10:21 PM, Yusuf Can Gürkan 
wrote:

> Hello
>
> I have a question about AWS CLI for people who use it.
>
> I create a spark cluster with aws cli and i’m using spark step with jar
> dependencies. But as you can see below i can not set multiple jars because
> AWS CLI replaces comma with space in ARGS.
>
> Is there a way of doing it? I can accept every kind of solutions. For
> example, i tried to merge these two jar dependencies but i could not manage
> it.
>
>
> aws emr create-cluster
> …..
> …..
> Args=[--class,com.blabla.job,
> —jars,"/home/hadoop/first.jar,/home/hadoop/second.jar",
> /home/hadoop/main.jar,--verbose]
>
>
> I also tried to escape comma with \\, but it did not work.
>


Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread Akhil Das
Have a look at this SO
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
question,
it has discussion on various ways of accessing S3.

Thanks
Best Regards

On Fri, May 8, 2015 at 1:21 AM, in4maniac sa...@skimlinks.com wrote:

 Hi Guys,

 I think this problem is related to :

 http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-td8689.html

 I am running pyspark 1.2.1 in AWS with my AWS credentials exported to
 master
 node as Environmental Variables.

 Halfway through my application, I get thrown with a
 org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
 S3 HEAD request failed for file path - ResponseCode=403,
 ResponseMessage=Forbidden

 Here is some important information about my job:
 + my AWS credentials exported to master node as Environmental Variables
 + there are no '/'s in my secret key
 + The earlier steps that uses this parquet file actually complete
 successsfully
 + The step before the count() does the following:
+ reads the parquet file (SELECT STATEMENT)
+ maps it to an RDD
+ runs a filter on the RDD
 + The filter works as follows:
+ extracts one field from each RDD line
+ checks with a list of 40,000 hashes for presence (if field in
 LIST_OF_HASHES.value)
+ LIST_OF_HASHES is a broadcast object

 The wierdness is that I am using this parquet file in earlier steps and it
 works fine. The other confusion I have is due to the fact that it only
 starts failing halfway through the stage. It completes a fraction of tasks
 and then starts failing..

 Hoping to hear something positive. Many thanks in advance

 Sahanbull

 The stack trace is as follows:
  negativeObs.count()
 [Stage 9:==   (161 + 240) /
 800]

 15/05/07 07:55:59 ERROR TaskSetManager: Task 277 in stage 9.0 failed 4
 times; aborting job
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /root/spark/python/pyspark/rdd.py, line 829, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /root/spark/python/pyspark/rdd.py, line 820, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /root/spark/python/pyspark/rdd.py, line 725, in reduce
 vals = self.mapPartitions(func).collect()
   File /root/spark/python/pyspark/rdd.py, line 686, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o139.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task
 277 in stage 9.0 failed 4 times, most recent failure: Lost task 277.3 in
 stage 9.0 (TID 4832, ip-172-31-1-185.ec2.internal):
 org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
 S3 HEAD request failed for

 '/subbucket%2Fpath%2F2Fpath%2F2Fpath%2F2Fpath%2F2Fpath%2Ffilename.parquet%2Fpart-r-349.parquet'
 - ResponseCode=403, ResponseMessage=Forbidden
 at

 org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:122)
 at sun.reflect.GeneratedMethodAccessor116.invoke(Unknown Source)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
 at org.apache.hadoop.fs.s3native.$Proxy9.retrieveMetadata(Unknown
 Source)
 at

 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
 at
 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
 at

 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
 at
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at 

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread in4maniac
HI GUYS... I realised that it was a bug in my code that caused the code to
break.. I was running the filter on a SchemaRDD when I was supposed to be
running it on an RDD. 

But I still don't understand why the stderr was about S3 request rather than
a type checking error such as No tuple position 0 found in Row type was
thrown. The error was kinda misleading that I kindof oversaw this logical
error in my code. 

Just thought should keep this posted. 

-in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-fails-with-org-apache-hadoop-fs-s3-S3Exception-FORBIDDEN-tp22800p22815.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: AWS SDK HttpClient version conflict (spark.files.userClassPathFirst not working)

2015-03-12 Thread 浦野 裕也

Hi Adam,

Could you try building spark with profile -Pkinesis-asl.
mvn -Pkinesis-asl -DskipTests clean package

refers to 'Running the Example' section.
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html

In fact, I've seen same issue and have been able to use the AWS SDK by 
trying the above.
I've also tried to use 'spark.files.userClassPathFirst' flag, and it 
doesn't work.


Regards,
Yuya Urano


On 2015/03/13 3:50, Adam Lewandowski wrote:
I'm trying to use the AWS SDK (v1.9.23) to connect to DynamoDB from 
within a Spark application. Spark 1.2.1 is assembled with HttpClient 
4.2.6, but the AWS SDK is depending on HttpClient 4.3.4 for it's 
communication with DynamoDB. The end result is an error when the app 
tries to connect to DynamoDB and gets Spark's version instead:

java.lang.NoClassDefFoundError: org/apache/http/client/methods/HttpPatch
at com.amazonaws.http.AmazonHttpClient.clinit(AmazonHttpClient.java:129)
at 
com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:120)
at 
com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.init(AmazonDynamoDBClient.java:359)
Caused by: java.lang.ClassNotFoundException: 
org.apache.http.client.methods.HttpPatch


Including HttpClient 4.3.4 as user jars doesn't improve the situation 
much:
java.lang.NoSuchMethodError: 
org.apache.http.params.HttpConnectionParams.setSoKeepalive(Lorg/apache/http/params/HttpParams;Z)V
at 
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:95)


I've seen the documenation regarding the 
'spark.files.userClassPathFirst' flag and have tried to use it 
thinking it would resolve this issue. However, when that flag is used 
I get an NoClassDefFoundError on 'scala.Serializable':

java.lang.NoClassDefFoundError: scala/Serializable
...
at 
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:46)

...
Caused by: java.lang.ClassNotFoundException: scala.Serializable

This seems odd to me, since scala.Serializable is included in the 
spark assembly. I thought perhaps my app was compiled against a 
different scala version than spark uses, but eliminated that 
possibility by using the scala compiler directly out of the spark 
assembly jar with identical results.


Has anyone else seen this issue, had any success with the 
spark.files.userClassPathFirst flag, or been able to use the AWS SDK?
I was going to submit this a Spark JIRA issue, but thought I would 
check here first.


Thanks,
Adam Lewandowski




Re: AWS Credentials for private S3 reads

2014-07-02 Thread Matei Zaharia
When you use hadoopConfiguration directly, I don’t think you have to replace 
the “/“ with “%2f”. Have you tried it without that? Also make sure you’re not 
replacing slashes in the URL itself.

Matei

On Jul 2, 2014, at 4:17 PM, Brian Gawalt bgaw...@gmail.com wrote:

 Hello everyone,
 
 I'm having some difficulty reading from my company's private S3 buckets. 
 I've got an S3 access key and secret key, and I can read the files fine from
 a non-Spark Scala routine via  AWScala http://github.com/seratch/AWScala 
 . But trying to read them with the SparkContext.textFiles([comma separated
 s3n://bucket/key uris]) leads to the following stack trace (where I've
 changed the object key to use terms 'subbucket' and 'datafile-' for privacy
 reasons:
 
 [error] (run-main-0) org.apache.hadoop.fs.s3.S3Exception:
 org.jets3t.service.S3ServiceException: S3 HEAD request failed for
 '/subbucket%2F2014%2F01%2Fdatafile-01.gz' - ResponseCode=403,
 ResponseMessage=Forbidden
 org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
 S3 HEAD request failed for '/ja_quick_info%2F2014%2F01%2Fapplied-01.gz' -
 ResponseCode=403, ResponseMessage=Forbidden
   at
 org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:122)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI
 [... etc ...]
 
 I'm handing off the credentials themselves via the following method:
 
 def cleanKey(s: String): String = s.replace(/, %2F)
 
 val sc = new SparkContext(local[8], SparkS3Test)
 
 sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, 
 cleanKey(creds.accessKeyId))
 sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, 
 cleanKey(creds.secretAccessKey))
 
 The comma-separated URIs themselves look each look like:
 
 s3n://odesk-bucket-name/subbucket/2014/01/datafile-01.gz
 
 The actual string that I've replaced with 'subbucket' includes underscores
 but otherwise is just straight ASCII; the term 'datafile' is substituting is
 also just straight ASCII.
 
 This is using Spark 1.0.0, via a library dependency to sbt of:
 org.apache.spark % spark-core_2.10 % 1.0.0
 
 Any tips appeciated!
 Thanks much,
 -Brian
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-tp8687.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
Marco,

If you call spark-ec2 launch without specifying an AMI, it will default to
the Spark-provided AMI.

Nick


On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Hi there,
 To answer your question; no there is no reason NOT to use an AMI that
 Spark has prepared. The reason we haven't is that we were not aware such
 AMIs existed. Would you kindly point us to the documentation where we can
 read about this further?

 Many many thanks, Shivaram.
 Marco.


 On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Is there any reason why you want to start with a vanilla amazon AMI
 rather than the ones we build and provide as a part of Spark EC2 scripts ?
 The AMIs we provide are close to the vanilla AMI but have the root account
 setup properly and install packages like java that are used by Spark.

 If you wish to customize the AMI, you could always start with our AMI and
 add more packages you like -- I have definitely done this recently and it
 works with HVM and PVM as far as I can tell.

 Shivaram


 On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 I was able to keep the workaround ...around... by overwriting the
 generated '/root/.ssh/authorized_keys' file with a known good one, in the
 '/etc/rc.local' file


 On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally I've
 created several of my own AMIs with the following characteristics. None of
 which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the
 ec2-user does not have to use sudo to perform any operations needing root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community to
 know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10
 -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge
 launch marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 
 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that
 it must be 'root'. The typical workaround is as you said, allow the ssh
 with the root user. Now, don't laugh, but, this worked last Friday, but
 today (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root
 user's 'authorized_keys' file is always overwritten. This means the
 workaround doesn't work anymore! I would LOVE for someone to verify 
 this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access
 and a lot of internal scripts assume have the user's home directory 
 hard
 coded as /root.   However all the Spark AMIs we build should have root 
 ssh
 access -- Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled
 for ssh. 

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
And for the record, that AMI is ami-35b1885c. Again, you don't need to
specify it explicitly; spark-ec2 will default to it.


On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Marco,

 If you call spark-ec2 launch without specifying an AMI, it will default to
 the Spark-provided AMI.

 Nick


 On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi there,
 To answer your question; no there is no reason NOT to use an AMI that
 Spark has prepared. The reason we haven't is that we were not aware such
 AMIs existed. Would you kindly point us to the documentation where we can
 read about this further?

 Many many thanks, Shivaram.
 Marco.


 On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Is there any reason why you want to start with a vanilla amazon AMI
 rather than the ones we build and provide as a part of Spark EC2 scripts ?
 The AMIs we provide are close to the vanilla AMI but have the root account
 setup properly and install packages like java that are used by Spark.

 If you wish to customize the AMI, you could always start with our AMI
 and add more packages you like -- I have definitely done this recently and
 it works with HVM and PVM as far as I can tell.

 Shivaram


 On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 I was able to keep the workaround ...around... by overwriting the
 generated '/root/.ssh/authorized_keys' file with a known good one, in the
 '/etc/rc.local' file


 On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally I've
 created several of my own AMIs with the following characteristics. None of
 which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the
 ec2-user does not have to use sudo to perform any operations needing root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community
 to know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please 
 and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s
 10 -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge
 launch marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 
 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that
 it must be 'root'. The typical workaround is as you said, allow the ssh
 with the root user. Now, don't laugh, but, this worked last Friday, but
 today (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root
 user's 'authorized_keys' file is always overwritten. This means the
 workaround doesn't work anymore! I would LOVE for someone to verify 
 this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access
 and a lot of internal scripts assume have the user's home directory 
 hard
 coded as /root.   However all the Spark AMIs we build should have 
 root ssh
 access -- Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting 

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Marco Costantini
Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual
AMIs.


On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 And for the record, that AMI is ami-35b1885c. Again, you don't need to
 specify it explicitly; spark-ec2 will default to it.


 On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Marco,

 If you call spark-ec2 launch without specifying an AMI, it will default
 to the Spark-provided AMI.

 Nick


 On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi there,
 To answer your question; no there is no reason NOT to use an AMI that
 Spark has prepared. The reason we haven't is that we were not aware such
 AMIs existed. Would you kindly point us to the documentation where we can
 read about this further?

 Many many thanks, Shivaram.
 Marco.


 On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Is there any reason why you want to start with a vanilla amazon AMI
 rather than the ones we build and provide as a part of Spark EC2 scripts ?
 The AMIs we provide are close to the vanilla AMI but have the root account
 setup properly and install packages like java that are used by Spark.

 If you wish to customize the AMI, you could always start with our AMI
 and add more packages you like -- I have definitely done this recently and
 it works with HVM and PVM as far as I can tell.

 Shivaram


 On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 I was able to keep the workaround ...around... by overwriting the
 generated '/root/.ssh/authorized_keys' file with a known good one, in the
 '/etc/rc.local' file


 On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally I've
 created several of my own AMIs with the following characteristics. None 
 of
 which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new 
 AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the
 ec2-user does not have to use sudo to perform any operations needing root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community
 to know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please 
 and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s
 10 -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge
 launch marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 
 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and
 that it must be 'root'. The typical workaround is as you said, allow 
 the
 ssh with the root user. Now, don't laugh, but, this worked last 
 Friday, but
 today (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the
 root user's 'authorized_keys' file is always overwritten. This means 
 the
 workaround doesn't work anymore! I would LOVE for someone to verify 
 this.

 Just to point out, I am trying to make this work with a
 paravirtual instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access
 and a lot of internal scripts assume have the user's home directory 
 hard
 coded as /root.   However all the Spark AMIs 

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Shivaram Venkataraman
The AMI should automatically switch between PVM and HVM based on the
instance type you specify on the command line. For reference (note you
don't need to specify this on the command line), the PVM ami id
is ami-5bb18832 in us-east-1.

FWIW we maintain the list of AMI Ids (across regions and pvm, hvm) at
https://github.com/mesos/spark-ec2/tree/v2/ami-list

Thanks
Shivaram


On Wed, Apr 9, 2014 at 9:12 AM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual
 AMIs.


 On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 And for the record, that AMI is ami-35b1885c. Again, you don't need to
 specify it explicitly; spark-ec2 will default to it.


 On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Marco,

 If you call spark-ec2 launch without specifying an AMI, it will default
 to the Spark-provided AMI.

 Nick


 On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi there,
 To answer your question; no there is no reason NOT to use an AMI that
 Spark has prepared. The reason we haven't is that we were not aware such
 AMIs existed. Would you kindly point us to the documentation where we can
 read about this further?

 Many many thanks, Shivaram.
 Marco.


 On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Is there any reason why you want to start with a vanilla amazon AMI
 rather than the ones we build and provide as a part of Spark EC2 scripts ?
 The AMIs we provide are close to the vanilla AMI but have the root account
 setup properly and install packages like java that are used by Spark.

 If you wish to customize the AMI, you could always start with our AMI
 and add more packages you like -- I have definitely done this recently and
 it works with HVM and PVM as far as I can tell.

 Shivaram


 On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 I was able to keep the workaround ...around... by overwriting the
 generated '/root/.ssh/authorized_keys' file with a known good one, in the
 '/etc/rc.local' file


 On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally
 I've created several of my own AMIs with the following characteristics.
 None of which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new 
 AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is 
 overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the
 ec2-user does not have to use sudo to perform any operations needing 
 root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community
 to know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 
 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux 
 paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please 
 and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s
 10 -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t 
 m3.2xlarge
 launch marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 
 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and
 that it must be 'root'. The typical workaround is as you said, allow 
 the
 ssh with the root user. Now, don't laugh, but, this worked last 
 Friday, but
 today (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the
 root user's 'authorized_keys' file is always overwritten. This means 
 the
 workaround doesn't 

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
Another thing I didn't mention. The AMI and user used: naturally I've
created several of my own AMIs with the following characteristics. None of
which worked.

1) Enabling ssh as root as per this guide (
http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
When doing this, I do not specify a user for the spark-ec2 script. What
happens is that, it works! But only while it's alive. If I stop the
instance, create an AMI, and launch a new instance based from the new AMI,
the change I made in the '/root/.ssh/authorized_keys' file is overwritten

2) adding the 'ec2-user' to the 'root' group. This means that the ec2-user
does not have to use sudo to perform any operations needing root
privilidges. When doing this, I specify the user 'ec2-user' for the
spark-ec2 script. An error occurs: rsync fails with exit code 23.

I believe HVMs still work. But it would be valuable to the community to
know that the root user work-around does/doesn't work any more for
paravirtual instances.

Thanks,
Marco.


On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell script
 which calls spark-ec2 wrapper script. I execute it from the 'ec2' directory
 of spark, as usual. The AMI used is the raw one from the AWS Quick Start
 section. It is the first option (an Amazon Linux paravirtual image). Any
 ideas or confirmation would be GREATLY appreciated. Please and thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v
 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch
 marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to launch
 the instances ? The typical workflow is to use the spark-ec2 wrapper script
 using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root
 user's 'authorized_keys' file is always overwritten. This means the
 workaround doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for
 ssh. Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user
 to log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.








Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
I was able to keep the workaround ...around... by overwriting the
generated '/root/.ssh/authorized_keys' file with a known good one, in the
'/etc/rc.local' file


On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally I've
 created several of my own AMIs with the following characteristics. None of
 which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the ec2-user
 does not have to use sudo to perform any operations needing root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community to
 know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v
 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch
 marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root
 user's 'authorized_keys' file is always overwritten. This means the
 workaround doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded 
 as
 /root.   However all the Spark AMIs we build should have root ssh access 
 --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for
 ssh. Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user
 to log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.









Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi Shivaram,

OK so let's assume the script CANNOT take a different user and that it must
be 'root'. The typical workaround is as you said, allow the ssh with the
root user. Now, don't laugh, but, this worked last Friday, but today
(Monday) it no longer works. :D Why? ...

...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
'authorized_keys' file is always overwritten. This means the workaround
doesn't work anymore! I would LOVE for someone to verify this.

Just to point out, I am trying to make this work with a paravirtual
instance and not an HVM instance.

Please and thanks,
Marco.


On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a lot
 of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
 Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user to
 log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.





Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
Hmm -- That is strange. Can you paste the command you are using to launch
the instances ? The typical workflow is to use the spark-ec2 wrapper script
using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html

Shivaram


On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
 'authorized_keys' file is always overwritten. This means the workaround
 doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
 Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user to
 log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.