Hi All,
I am using the latest version of EMR to overwrite Parquet files to an S3 bucket
encrypted with a KMS key. I am seeing the attached error whenever I Overwrite a
parquet file. For example the below code produces the attached error and
stacktrace:
Hi Jeroen,
in case you are using HIVE partitions how many partitions do you have?
Also is there any chance that you might post the code?
Regards,
Gourav Sengupta
On Tue, Jan 2, 2018 at 7:50 AM, Jeroen Miller
wrote:
> Hello Gourav,
>
> On 30 Dec 2017, at 20:20, Gourav
Hello Mans,
On 1 Jan 2018, at 17:12, M Singh wrote:
> I am not sure if I missed it - but can you let us know what is your input
> source and output sink ?
Reading from S3 and writing to S3.
However the never-ending task 0.0 happens in a stage way before outputting
Hello Gourav,
On 30 Dec 2017, at 20:20, Gourav Sengupta wrote:
> Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY
Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input
source and output sink ?
In some cases, I found that saving to S3 was a problem. In this case I started
saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which
solved our issue.
Mans
Here is the list that I will probably try to fill:
1. Check GC on the offending executor when the task is running. May be
you need even more memory.
2. Go back to some previous successful run of the job and check the
spark ui for the offending stage and check max task time/max
Hi,
Please try to use the SPARK UI from the way that AWS EMR recommends, it
should be available from the resource manager. I never ever had any problem
working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
DEBUGGING.
Sadly, I cannot be of much help unless we go for a screen share
you may have to recreate your cluster with below configuration at emr
creation
"Configurations": [
{
"Properties": {
"maximizeResourceAllocation": "false"
},
"Classification": "spark"
}
]
On
On 28 Dec 2017, at 19:25, Patrick Alwell wrote:
> Dynamic allocation is great; but sometimes I’ve found explicitly setting the
> num executors, cores per executor, and memory per executor to be a better
> alternative.
No difference with spark.dynamicAllocation.enabled
Hello,
Just a quick update as I did not made much progress yet.
On 28 Dec 2017, at 21:09, Gourav Sengupta wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11
> instead?
Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>
, Jeroen Miller <bluedasya...@gmail.com>
wrote:
> On 28 Dec 2017, at 19:42, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
> > In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
>
> Nothing that I can think
On 28 Dec 2017, at 19:42, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff
behind my back)
On 28 Dec 2017, at 19:40, Maximiliano Felice
wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of
> a mix of speculative execution and OOM issues in the container.
Interesting! However I don't have any OOM exception in the logs.
On 28 Dec 2017, at 19:25, Patrick Alwell wrote:
> You are using groupByKey() have you thought of an alternative like
> aggregateByKey() or combineByKey() to reduce shuffling?
I am aware of this indeed. I do have a groupByKey() that is difficult to avoid,
but the
job in that?
Regards,
Gourav Sengupta
On Thu, Dec 28, 2017 at 4:06 PM, Jeroen Miller <bluedasya...@gmail.com>
wrote:
> Dear Sparkers,
>
> Once again in times of desperation, I leave what remains of my mental
> sanity to this wise and knowledgeable community.
>
> I have a
Hi Jeroen,
I experienced a similar issue a few weeks ago. The situation was a result
of a mix of speculative execution and OOM issues in the container.
First of all, when an executor takes too much time in Spark, it is handled
by the YARN speculative execution, which will launch a new executor
Joren,
Anytime there is a shuffle in the network, Spark moves to a new stage. It seems
like you are having issues either pre or post shuffle. Have you looked at a
resource management tool like ganglia to determine if this is a memory or
thread related issue? The spark UI?
You are using
On 28 Dec 2017, at 17:41, Richard Qiao wrote:
> Are you able to specify which path of data filled up?
I can narrow it down to a bunch of files but it's not so straightforward.
> Any logs not rolled over?
I have to manually terminate the cluster but there is nothing
Dear Sparkers,
Once again in times of desperation, I leave what remains of my mental sanity to
this wise and knowledgeable community.
I have a Spark job (on EMR 5.8.0) which had been running daily for months, if
not the whole year, with absolutely no supervision. This changed all of sudden
-
>> From: Andrew Holway <andrew.hol...@otternetworks.de>
>> Date: 1/15/17 11:37 AM (GMT-05:00)
>> To: Marco Mistroni <mmistr...@gmail.com>
>> Cc: Neil Jonkers <neilod...@gmail.com>, User <user@spark.apache.org>
>> Subject: Re: Running Spark o
w Holway <andrew.hol...@otternetworks.de>
> Date: 1/15/17 11:37 AM (GMT-05:00)
> To: Marco Mistroni <mmistr...@gmail.com>
> Cc: Neil Jonkers <neilod...@gmail.com>, User <user@spark.apache.org>
> Subject: Re: Running Spark on EMR
>
> Darn. I didn't respond t
m>, User
<user@spark.apache.org> Subject: Re: Running Spark on EMR
Darn. I didn't respond to the list. Sorry.
On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni <mmistr...@gmail.com> wrote:
thanks Neil. I followed original suggestion from Andrw and everything is
working fine nowkr
On Sun,
<neilod...@gmail.com> wrote:
>
>> Hello,
>>
>> Can you drop the url:
>>
>> spark://master:7077
>>
>> The url is used when running Spark in standalone mode.
>>
>> Regards
>>
>>
>> Original message ----
>
ng Spark in standalone mode.
>
> Regards
>
>
> Original message
> From: Marco Mistroni
> Date:15/01/2017 16:34 (GMT+02:00)
> To: User
> Subject: Running Spark on EMR
>
> hi all
> could anyone assist here?
> i am trying to run spark 2.0.0 on an EMR c
Hello,
Can you drop the url:
spark://master:7077
The url is used when running Spark in standalone mode.
Regards
Original message From: Marco Mistroni
<mmistr...@gmail.com> Date:15/01/2017 16:34 (GMT+02:00)
To: User <user@spark.apache.org> Subject: Running S
hi all
could anyone assist here?
i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues
connecting to the master node
So, below is a snippet of what i am doing
sc =
SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate()
sparkHost is passed as input
Hi AllI would like to ask for two thing and I really appreciate the answer
ASAP1. How do I implement the parallelism in Apache Spark java application?2.
How to run the Spark application in Amazon EMR?
Hi,
I am setting up a Scala spark streaming app in EMR . I wonder if anyone in
the list can help me with the following question :
1. What's the approach that you guys have been using to submit in an EMR
job step environment variables that will be needed by the Spark application
?
2. Can i have
and,
> since I'm on AWS, those service are pretty well integrated together..but I
> could just find examples about it using on standalone EC2 instances.
> >
> > In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
> mode), I would like to be able to real-
he CloudWatch Logs + Lambda pretty straightforward and, since
> I'm on AWS, those service are pretty well integrated together..but I could
> just find examples about it using on standalone EC2 instances.
>
> In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
&
..but I could
just find examples about it using on standalone EC2 instances.
In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
mode), I would like to be able to real-time monitor Spark logs, so not just
about when the processing ends and they are copied to S3. Is there any
out
be found from the YARN Resource Manager UI (master node:8088) and
it would be best to use a SOCKS proxy in order nicely resolve the URLs.
Best regards,
Christopher
From: SURAJ SHETH [mailto:shet...@gmail.com]
Sent: Sunday, November 15, 2015 8:19 AM
To: user@spark.apache.org
Subject: Yarn Spark
Hi,
Yarn UI on 18080 stops receiving updates Spark jobs/tasks immediately after
it starts. We see only one task completed in the UI while the other hasn't
got any resources while in reality, more than 5 tasks would have completed.
Hadoop - Amazon 2.6
Spark - 1.5
Thanks and Regards,
Suraj Sheth
Hi,
I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the
script provided by Anders (
https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup
Zeppelin. The Zeppelin can connect to Spark but when I got error when I run
the tutorials. and I get the following error
Please community, I'd really appreciate your opinion on this topic.
Best regards,
Roberto
-- Forwarded message --
From: Roberto Coluccio roberto.coluc...@gmail.com
Date: Sat, Jul 25, 2015 at 6:28 PM
Subject: [Spark + Hive + EMR + S3] Issue when reading from Hive external
table
Hello Spark community,
I currently have a Spark 1.3.1 batch driver, deployed in YARN-cluster mode
on an EMR cluster (AMI 3.7.0) that reads input data through an HiveContext,
in particular SELECTing data from an EXTERNAL TABLE backed on S3. Such
table has dynamic partitions and contains *hundreds
on Amazon. Do I still need to
provide the keys?
Thank you,
*From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
*Sent:* Tuesday, July 14, 2015 3:14 PM
*To:* Pagliari, Roberto
*Cc:* user@spark.apache.org
*Subject:* Re: Spark on EMR with S3 example (Python)
Hi Roberto,
I have written
to
provide the keys?
Thank you,
*From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
*Sent:* Tuesday, July 14, 2015 3:14 PM
*To:* Pagliari, Roberto
*Cc:* user@spark.apache.org
*Subject:* Re: Spark on EMR with S3 example (Python)
Hi Roberto,
I have written PySpark code that reads
Hi Roberto,
I have written PySpark code that reads from private S3 buckets, it should
be similar for public S3 buckets as well. You need to set the AWS access
and secret keys into the SparkContext, then you can access the S3 folders
and files with their s3n:// paths. Something like this:
sc =
Hi Sujit,
I just wanted to access public datasets on Amazon. Do I still need to provide
the keys?
Thank you,
From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Tuesday, July 14, 2015 3:14 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: Spark on EMR with S3 example (Python
Is there an example about how to load data from a public S3 bucket in Python? I
haven't found any.
Thank you,
You can use Spark 1.4 on EMR AMI 3.8.0 if you install Spark as a 3rd party
application using the bootstrap action directly without the native Spark
inclusion with 1.3.1. See
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
Refer to
https://github.com/awslabs/emr-bootstrap
It looks like it is a wrapper around
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.
https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html
2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda
Any ideas what version of Spark is underneath?
i.e. is it 1.4? and is SparkR supported on Amazon EMR?
On Wed, Jun 17, 2015 at 12:06 AM, ayan guha guha.a...@gmail.com wrote:
That's great news. Can I assume spark on EMR supports kinesis to hbase
pipeline?
On 17 Jun 2015 05:29, kamatsuoka ken
is currently being
used under the hood, passing -v,1.4.0 in the options is not supported.
Sent from Ninehttp://www.9folders.com/
From: Eugen Cepoi cepoi.eu...@gmail.com
Sent: Jun 17, 2015 6:37 AM
To: Hideyoshi Maeda
Cc: ayan guha;kamatsuoka;user
Subject: Re: Spark on EMR
It looks like
That's great news. Can I assume spark on EMR supports kinesis to hbase
pipeline?
On 17 Jun 2015 05:29, kamatsuoka ken...@gmail.com wrote:
Spark is now officially supported on Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/details/spark/
--
View this message in context
46 matches
Mail list logo