understanding right?
Manohar
From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Thursday, January 5, 2017 11:05 PM
To: Manohar Reddy
Cc: user@spark.apache.org
Subject: Re: Spark Read from Google store and save in AWS s3
On 5 Jan 2017, at 09:58, Manohar753
<manohar.re...@happiestminds.
On 5 Jan 2017, at 09:58, Manohar753
<manohar.re...@happiestminds.com<mailto:manohar.re...@happiestminds.com>> wrote:
Hi All,
Using spark is interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark
Hi All,
Using spark is interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark and do some
processing and finally needs to store in S3 and my spark engine runs on AWS
Cluster.
Please let me back is there any way
ns as stated in the bloga shot, but changing mode to append.
>
> On Sat, Dec 10, 2016 at 8:25 AM, shyla deshpande <deshpandesh...@gmail.com
> > wrote:
>
>> Hello all,
>>
>> Is it possible to Write data from Spark streaming to AWS Redshift?
>>
>> I came acros
data from Spark streaming to AWS Redshift?
>
> I came across the following article, so looks like it works from a Spark
> batch program.
>
> https://databricks.com/blog/2015/10/19/introducing-
> redshift-data-source-for-spark.html
>
> I want to write to AWS Redshift from Spark
Hello all,
Is it possible to Write data from Spark streaming to AWS Redshift?
I came across the following article, so looks like it works from a Spark
batch program.
https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html
I want to write to AWS Redshift from
s3.awsAcces sKeyId",AccessKey)
hadoopConf.set("fs.s3.awsSecre tAccessKey",SecretKey)
var jobInput = sc.textFile("s3://path to bucket")
Thanks
On Fri, Aug 26, 2016 at 5:16 PM, kant kodali < kanth...@gmail.com > wrote:
Hi guys,
Are there any instructions on how to setup spark with S3 on AWS?
Thanks!
th...@gmail.com> wrote:
> Hi guys,
>
> Are there any instructions on how to setup spark with S3 on AWS?
>
> Thanks!
>
>
Hi guys,
Are there any instructions on how to setup spark with S3 on AWS?
Thanks!
Also for the record, turning on kryo was not able to help.
On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra wrote:
> Splitting up the Maps to separate objects did not help.
>
> However, I was able to work around the problem by reimplementing it with
> RDD joins.
>
> On Aug
Splitting up the Maps to separate objects did not help.
However, I was able to work around the problem by reimplementing it with
RDD joins.
On Aug 18, 2016 5:16 PM, "Arun Luthra" wrote:
> This might be caused by a few large Map objects that Spark is trying to
>
This might be caused by a few large Map objects that Spark is trying to
serialize. These are not broadcast variables or anything, they're just
regular objects.
Would it help if I further indexed these maps into a two-level Map i.e.
Map[String, Map[String, Int]] ? Or would this still count against
I got this OOM error in Spark local mode. The error seems to have been at
the start of a stage (all of the stages on the UI showed as complete, there
were more stages to do but had not showed up on the UI yet).
There appears to be ~100G of free memory at the time of the error.
Spark 2.0.0
200G
When I spin up an AWS Spark cluster per the Spark EC2 script:
According to AWS:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#fixed-duration-spot-instances
there is a way of reserving for a fixed duration Spot cluster through AWSCLI
and the web portal but I can't find
t only uses one of the
> machines(instead of the 3 available) of the cluster.
>
> Is there any parameter that can be set to force it to use all the cluster.
>
> I am using AWS EMR with Yarn.
>
>
> Thanks,
> Natu
>
>
>
>
>
>
>
--
Cheers!
Hi,
I am running some spark loads . I notice that in it only uses one of the
machines(instead of the 3 available) of the cluster.
Is there any parameter that can be set to force it to use all the cluster.
I am using AWS EMR with Yarn.
Thanks,
Natu
Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv
>> <daniel.ha...@veracity-group.com> wrote:
>> Hi,
>> I'm trying to create a table on s3a but I keep hitting the following error:
>> Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:
>> MetaEx
ption in thread "main"
> org.apache.hadoop.hive.ql.metadata.HiveException:
> MetaException(*message:com.cloudera.com.amazonaws.AmazonClientException:
> Unable to load AWS credentials from any provider in the chain*)
>
>
>
> I tried setting the s3a keys using the confi
On 9 Jun 2016, at 06:17, Daniel Haviv
> wrote:
Hi,
I've set these properties both in core-site.xml and hdfs-site.xml with no luck.
Thank you.
Daniel
That's not good.
I'm afraid I don't know what version of s3a is in the
; wrote:
>>
>> Hi,
>> I'm trying to create a table on s3a but I keep hitting the following error:
>> Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:
>> MetaException(message:com.cloudera.com.amazonaws.AmazonClientException:
>>
eption:
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable
to load AWS credentials from any provider in the chain)
I tried setting the s3a keys using the configuration object but I might be
hitting SPARK-11364<https://issues.apache.org/jira/browse/SPARK-11364> :
Hi,
I'm trying to create a table on s3a but I keep hitting the following error:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable
to load AWS credentials from any provider in the
Hello everyone, I am trying to compute the similarity between 550k objects
using the DIMSUM algorithm available in Spark 1.6.
The cluster runs on AWS Elastic Map Reduce and consists of 6 r3.2xlarge
instances (one master and five cores), having 8 vCPU and 61 GiB of RAM each.
My input data
Hi
was wondering if anyone can assist here..
I am trying to create a spark cluster on AWS using scripts located in
spark-1.6.1/ec2 directory
When the spark_ec2.py scripts tries to do a rsync to copy directories over
to teh AWS
master node it fails miserably with this stack trace
DEBUG:spark ecd
Hi,
I agree with Steve, just start using vanilla SPARK EMR.
You can try to see point #4 here for dynamic allocation of executors
https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin
.
Note that dynamic allocation of
Hi, here we made several optimizations for accessing s3 from spark:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
such as:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133
you can deploy
On 28 Apr 2016, at 22:59, Alexander Pivovarov
> wrote:
Spark works well with S3 (read and write). However it's recommended to set
spark.speculation true (it's expected that some tasks fail if you read large S3
folder, so speculation should
involved are. But it required lots of tuning work, because we are
> clearly under the recommended requirements. 4 of the 5 machines are
> switched off during the night, only the bridge machine is alive 24/7.
>
> 12$ per month in total.
>
> Renato Perini.
>
>
> Il 28/04/201
month in total.
Renato Perini.
Il 28/04/2016 23:39, Fatma Ozcan ha scritto:
What is your experience using Spark on AWS? Are you setting up your
own Spark cluster, and using HDFS? Or are you using Spark as a service
from AWS? In the latter case, what is your experience of using S3
directly, without
Fatima, the easiest way to create Spark cluster on AWS is to create EMR
cluster and select Spark application. (the latest EMR includes Spark 1.6.1)
Spark works well with S3 (read and write). However it's recommended to
set spark.speculation true (it's expected that some tasks fail if you read
What is your experience using Spark on AWS? Are you setting up your own
Spark cluster, and using HDFS? Or are you using Spark as a service from
AWS? In the latter case, what is your experience of using S3 directly,
without having HDFS in between?
Thanks,
Fatma
-a-tutorial-or-guide-to-implement-Spark-AWS-Caffe-CUDA-tp26705p26707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail
I use spot instances for 100 slaves cluster (r3.2xlarge on us-west-1)
Jobs I run usually take about 15 hours - cluster is stable and fast. 1-2
computers might be terminated but it's very rare event and Spark can handle
it.
On Fri, Mar 25, 2016 at 6:28 PM, Sven Krasser wrote:
When a spot instance terminates, you lose all data (RDD partitions) stored
in the executors that ran on that instance. Spark can recreate the
partitions from input data, but if that requires going through multiple
preceding shuffles a good chunk of the job will need to be redone.
-Sven
On Thu,
I'm very new to apache spark. I'm just a user not a developer.
I'm running a cluster with many spot instances. Am I correct in
understanding that spark can handle an unlimited number of spot instance
failures and restarts? Sometimes all the spot instances will dissapear
without warning, and then
Hi All,
I want to enable the netlib-java feather for Spark ML module base on AWS
EMR. But the Spark cluster has install spark default except I install it
myself and configure all the cluster. Does anyone have some idea to just
enable the netlib-java base on the standard EMR Spark cluster
Hi guys,
I'm having a problem where respawning a failed executor during a job that
reads/writes parquet on S3 causes subsequent tasks to fail because of
missing AWS keys.
Setup:
I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple
standalone cluster:
1 master
2 workers
My
On 17 Mar 2016, at 16:01, Allen George
<allen.geo...@gmail.com<mailto:allen.geo...@gmail.com>> wrote:
Hi guys,
I'm having a problem where respawning a failed executor during a job that
reads/writes parquet on S3 causes subsequent tasks to fail because of missing
AWS keys.
Setup
, 2016 at 1:20 PM, Afshartous, Nick <nafshart...@turbine.com>
wrote:
>
> Hi,
>
>
> On AWS EMR 4.2 / Spark 1.5.2, I tried the example here
>
>
>
> https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#hive-tables
>
>
> to load data from a file into a
Hi,
On AWS EMR 4.2 / Spark 1.5.2, I tried the example here
https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#hive-tables
to load data from a file into a Hive table.
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> sqlContext.sql(&
/ElasticMapReduce/latest/ReleaseGuide/emr-release-differences.html
~ Jonathan
On Fri, Feb 26, 2016 at 6:38 PM Weiwei Zhang <wzhan...@dons.usfca.edu>
wrote:
> Hi there,
>
> I am trying to configure memory for spark using AWS CLI. However, I got
> the following message:
&
Hi there,
I am trying to configure memory for spark using AWS CLI. However, I got the
following message:
*A client error (ValidationException) occurred when calling the RunJobFlow
operation: Cannot specify args for application 'Spark' when release label
is used.*
In the aws 'create-cluster
> On 9 Feb 2016, at 07:19, lmk wrote:
>
> Hi Dhimant,
> As I had indicated in my next mail, my problem was due to disk getting full
> with log messages (these were dumped into the slaves) and did not have
> anything to do with the content pushed into s3. So,
-aws-s3-put-tp10036p26174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
I had similar problems with multi part uploads. In my case the real error
was something else which was being masked by this issue
https://issues.apache.org/jira/browse/SPARK-6560. In the end this bad
digest exception was a side effect and not the original issue. For me it
was some library version
> On 7 Feb 2016, at 07:57, Dhimant wrote:
>
>at
> com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:245)
>... 15 more
> Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The
>
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p26167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
your life easier if you do go this route. Once you've
fleshed out your ideas I'm sure folks on this mailing list can provide
helpful guidance based on their real world experience with Spark.
> Does this pave the way into replacing
> the need of a pre-instantiated cluster in AWS or
Hi David,
My company uses Lamba to do simple data moving and processing using python
scripts. I can see using Spark instead for the data processing would make it
into a real production level platform. Does this pave the way into replacing
the need of a pre-instantiated cluster in AWS or bought
Apache Spark package offering seamless integration with the AWS
Lambda <https://aws.amazon.com/lambda/> compute service for Spark batch and
streaming applications on the JVM.
Within traditional Spark deployments RDD tasks are executed using fixed
compute resources on worker nodes within
pport, an
>>> interactive UI, security, and job scheduling.
>>>
>>> Specifically, Databricks runs standard Spark applications inside a
>>> user’s AWS account, similar to EMR, but it adds a variety of features to
>>> create an end-to-end environment for
>
> At its core, EMR just launches Spark applications, whereas Databricks is a
> higher-level platform that also includes multi-user support, an interactive
> UI, security, and job scheduling.
>
> Specifically, Databricks runs standard Spark applications inside a user’s
>
rak...@databricks.com> wrote:
> At its core, EMR just launches Spark applications, whereas Databricks is a
>> higher-level platform that also includes multi-user support, an interactive
>> UI, security, and job scheduling.
>>
>> Specifically, Databricks runs standard Spark applicatio
k applications, whereas Databricks is a
>>>> higher-level platform that also includes multi-user support, an
>>>> interactive UI, security, and job scheduling.
>>>>
>>>> Specifically, Databricks runs standard Spark applications inside a user’s
>&g
Hi,
It may be interesting to see this. Can you please create a hivecontext
(using standard AWS Spark stack on EMR 4.0) and create a table to read the
avro file and read data into a dataframe using hivecontext sql?
Please let me know if i can be of any help with this.
Regards,
Gourav
On Wed
;>> Gourav Sengupta
>>>
>>>
>>> On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> are
.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
&g
>>
>>> Hi,
>>>
>>> are you creating RDD's out of the data?
>>>
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Jan 26, 2016 at 12:45 PM, aecc <alessandroa...@gmail.com> wrote:
>>>
>&g
-list.1001560.n3.nabble.com/Terminating-Spark-Steps-in-AWS-tp26076.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e
without terminating the entire cluster?
>
> Thank you,
>
> Daniel
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Terminating-Spark-Ste
As a user of AWS EMR (running Spark and MapReduce), I am interested in
potential benefits that I may gain from Databricks Cloud. I was wondering
if anyone has used both and done comparison / contrast between the two
services.
In general, which resource manager(s) does Databricks Cloud use
minating the entire cluster?
>>
>> Thank you,
>>
>> Daniel
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Terminating-Spark-Steps-in-AWS-tp26076.html
>> Sent from the Apache Spark U
spark-user-list.1001560.n3.nabble.com/newAPIHadoopFile-uses-AWS-credentials-from-other-threads-tp26081p26082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: use
in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
8.
> The number of partitions used when reading data is 7315.
> The maximum size of a file to read is 14G
> The size of the folder is around: 270G
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-acces
Sorry, I have not been able to solve the issue. I used speculation mode as
workaround to this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
Sent from the Apache Spark User List
On EMR, you can add fs.* params in emrfs-site.xml.
On Tue, Jan 12, 2016 at 7:27 AM, Jonathan Kelly
wrote:
> Yes, IAM roles are actually required now for EMR. If you use Spark on EMR
> (vs. just EC2), you get S3 configuration for free (it goes by the name
> EMRFS), and it
Yes, IAM roles are actually required now for EMR. If you use Spark on EMR
(vs. just EC2), you get S3 configuration for free (it goes by the name
EMRFS), and it will use your IAM role for communicating with S3. Here is
the corresponding documentation:
Hi all,
Is there a method for reading from s3 without having to hard-code keys? The
only 2 ways I've found both require this:
1. Set conf in code e.g.:
sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "")
sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey",
"")
2. Set keys in URL, e.g.:
If you are on EMR, these can go into your hdfs site config. And will work
with Spark on YARN by default.
Regards
Sab
On 11-Jan-2016 5:16 pm, "Krishna Rao" wrote:
> Hi all,
>
> Is there a method for reading from s3 without having to hard-code keys?
> The only 2 ways I've
In production, I'd recommend using IAM roles to avoid having keys altogether.
Take a look at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.
Matei
> On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan
> wrote:
>
> If you are
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev
>> <kudryavtsev.konstan...@gmail.com> wr
ing...@gmail.com> wrote:
>>>>> Hi Kostiantyn,
>>>>>
>>>>> I want to confirm that it works first by using hdfs-site.xml. If yes, you
>>>>> could define different spark-{user-x}.conf and source them during
>>>>> spark-submit. l
ould.
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>> Hi Jerry,
>>
>> I want to run different
let us know if hdfs-site.xml works first. It should.
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev
>>> <kudryavtsev.konstan...@gmail.com> wrote:
>>>
> On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev
> <kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Jerry,
>
> I want to run different jobs on different S3 buckets - different AWS creds -
> on the same instances. Could you shed some light if it's possible t
;
> > Hi Jerry,
> >
> > I want to run different jobs on different S3 buckets - different AWS
> creds - on the same instances. Could you shed some light if it's possible
> to achieve with hdfs-site?
> >
> > Thank you,
> > Konstantin Kudryavtsev
> >
>
uld.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Jerry,
>
> I want to run different jobs on different S3 buckets - different AWS creds
>
> Sent from my iPhone
>
> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Jerry,
>
> I want to run different jobs on different S3 buckets - different AWS creds
> - on the same instances. Could you shed s
http://x.x.x.x/latest/meta-data/iam/security-credentials/15/12/30>
17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
com.amazonaws.AmazonClientException: Unable to load AWS credentials
from any provider in the chain
at
com.amazonaws.auth.AWSCred
ity-credentials/
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load
> credentials from InstanceProfileCredentialsProvider: The requested metadata
> is not found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 17:00:32 ERROR Exe
/latest/meta-data/iam/security-credentials/
> 15/12/30 <http://x.x.x.x/latest/meta-data/iam/security-credentials/15/12/30>
> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any
> provider in
http://x.x.x.x/latest/meta-data/iam/security-credentials/15/12/30>
17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
com.amazonaws.AmazonClientException: Unable to load AWS credentials
from any provider in the chain
at
com.amazonaws.auth.AWSCred
a is not found at
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 <http://x.x.x.x/latest/meta-data/iam/security-credentials/15/12/30>
> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> com.amazonaws.AmazonC
http://x.x.x.x/latest/meta-data/iam/security-credentials/
>> 15/12/30 <http://x.x.x.x/latest/meta-data/iam/security-credentials/15/12/30>
>> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from
>> InstanceProfileCredentialsProvider: The requested me
couple things:
1) switch to IAM roles if at all possible - explicitly passing AWS
credentials is a long and lonely road in the end
2) one really bad workaround/hack is to run a job that hits every worker
and writes the credentials to the proper location (~/.awscredentials or
whatever)
^^ i
;ch...@fregly.com> wrote:
> couple things:
>
> 1) switch to IAM roles if at all possible - explicitly passing AWS
> credentials is a long and lonely road in the end
>
> 2) one really bad workaround/hack is to run a job that hits every worker
> and writes the credentials to the proper
Kudryavtsev
>
>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly <ch...@fregly.com> wrote:
>> couple things:
>>
>> 1) switch to IAM roles if at all possible - explicitly passing AWS
>> credentials is a long and lonely road in the end
>>
>&g
Hi Jerry,
I want to run different jobs on different S3 buckets - different AWS creds
- on the same instances. Could you shed some light if it's possible to
achieve with hdfs-site?
Thank you,
Konstantin Kudryavtsev
On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <chiling...@gmail.com> wrote:
pm, KOSTIANTYN Kudriavtsev
> <kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Jerry,
>
> I want to run different jobs on different S3 buckets - different AWS creds -
> on the same instances. Could you shed some light if it's possible to achieve
> with hdfs-site
Hi Rastan,
Unless you're using off-heap memory or starting multiple executors per
machine, I would recommend the r3.2xlarge option, since you don't actually
want gigantic heaps (100GB is more than enough). I've personally run Spark
on a very large scale with r3.8xlarge instances, but I've been
Andrew, it's going to be 4 execotor jvms on each r3.8xlarge.
Rastan, you can run quick test using emr spark cluster on spot instances
and see what configuration works better. Without the tests it is all
speculation.
On Dec 18, 2015 1:53 PM, "Andrew Or" wrote:
> Hi Rastan,
I'm trying to determine whether I should be using 10 r3.8xlarge or 40
r3.2xlarge. I'm mostly concerned with shuffle performance of the
application.
If I go with r3.8xlarge I will need to configure 4 worker instances per
machine to keep the JVM size down. The worker instances will likely contend
Not a direct answer but you can create a big fat jar combining all the
classes in the three jars and pass it.
Thanks
Best Regards
On Thu, Dec 3, 2015 at 10:21 PM, Yusuf Can Gürkan <yu...@useinsider.com>
wrote:
> Hello
>
> I have a question about AWS CLI for people who use i
Hi,
We encounter a problem very similar to this one:
https://www.mail-archive.com/search?l=user@spark.apache.org=subject:%22Spark+task+hangs+infinitely+when+accessing+S3+from+AWS%22=newest=1
When reading large amount of data from S3, one or several tasks hung. It
doesn't happen every time
Hello
I have a question about AWS CLI for people who use it.
I create a spark cluster with aws cli and i’m using spark step with jar
dependencies. But as you can see below i can not set multiple jars because AWS
CLI replaces comma with space in ARGS.
Is there a way of doing it? I can accept
Any hints?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25365.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
input paths requires an
AWS S3 API call to list everything based on the common-prefix; so if your
input is something like;
s3://my-bucket*.json
Then the prefix "///" will be passed to the API and
should be fairly efficient.
However if you're doing something more adventurous like;
/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e
A small csv file in S3. I use s3a://key:seckey@bucketname/a.csv
It works for SparkContext
pixelsStr: SparkContext = ctx.textFile(s3pathOrg);
It works for Java Spark-csv as well
Java code : DataFrame careerOneDF = sqlContext.read().format(
"com.databricks.spark.csv")
101 - 200 of 288 matches
Mail list logo