Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
I don't really understand how Iceberg and the hadoop libraries can coexist in a deployment. The latest spark (3.5.1) base image contains the hadoop-client*-3.3.4.jar. The AWS v2 SDK is only supported in hadoop*-3.4.0.jar and onward. Iceberg AWS integration states AWS v2 SDK is required<ht

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Swapping out the iceberg-aws-bundle for the very latest aws provided sdk ('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a slightly different code path: java.lang.NoSuchMethodError: 'void org.apache.hadoop.util.SemaphoredDelegatingExecutor

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
[sorry; replying all this time] With hadoop-*-3.3.6 in place of the 3.4.0 below I get java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException I think that the below iceberg-aws-bundle version supplies the v2 sdk. Dan From: Aaron Grubb Sent: 03

Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using AWS. From: Oxlade, Dan Sent: Wednesday, April 3, 2024 2:41:11 PM To: user@spark.apache.org Subj

[Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Hi all, I've struggled with this for quite some time. My requirement is to read a parquet file from s3 to a Dataframe then append to an existing iceberg table. In order to read the parquet I need the hadoop-aws dependency for s3a:// . In order to write to iceberg I need the iceberg dependency

Re: automatically/dinamically renew aws temporary token

2023-10-24 Thread Carlos Aguni
i'm trying to argue on that path. by now even requesting an increase on the session duration is a struggle. but at the moment, since I was only allowed the AssumeRole approach i'm figuring out a way through this path. > https://github.com/zillow/aws-custom-credential-provider thank you Pol. I'll take

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Pol Santamaria
Hi Carlos! Take a look at this project, it's 6 years old but the approach is still valid: https://github.com/zillow/aws-custom-credential-provider The credential provider gets called each time an S3 or Glue Catalog is accessed, and then you can decide whether to use a cached token or renew

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Jörn Franke
Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? Assumerole can make sense if you use an AWS IAM user and STS authentication, but this would make no sense within AWS for cross-account access as attaching the permissions to the Glue job role

automatically/dinamically renew aws temporary token

2023-10-22 Thread Carlos Aguni
node? i'm currently using spark on AWS glue. wonder what options do I have. regards,c.

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-14 Thread Mich Talebzadeh
OK I managed to load the Python zipped file and the run py.file onto s3 for AWS EKS to work It is a bit of nightmare compared to the same on Google SDK which is simpler Anyhow you will require additional jar files to be added to $SPARK_HOME/jars. These two files will be picked up after you build

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Thanks! I will have a look. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
Yes, it looks inside the docker containers folder. It will work if you are using s3 og gs. ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh : > Hi, > > In my spark-submit to eks cluster, I use the standard code to submit to > the cluster as below: > > spark-submit --verbose \ >--master

Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Hi, In my spark-submit to eks cluster, I use the standard code to submit to the cluster as below: spark-submit --verbose \ --master k8s://$KUBERNETES_MASTER_IP:443 \ --deploy-mode cluster \ --name sparkOnEks \ --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \

Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-06 Thread Mich Talebzadeh
ut it seems like does not have corresponding > connector can use. I would confirm whether have another method in addition > to this solution > <https://repost.aws/questions/QUP_OJomilTO6oIgvK00VHEA/writing-data-to-kinesis-stream-from-py-spark> > 2. Because aws kinesis have quota limita

[Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-05 Thread hueiyuan su
whether have another method in addition to this solution <https://repost.aws/questions/QUP_OJomilTO6oIgvK00VHEA/writing-data-to-kinesis-stream-from-py-spark> 2. Because aws kinesis have quota limitation (like 1MB/s and 1000 records/s), if spark structured streaming micro batch size too large, h

Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently?

2023-02-16 Thread Vikas Kumar
ing > *Level*: Advanced > *Scenario*: How-to > > > *Problems Description* > I would like to implement witeStream data to AWS Kinesis with Spark > structured Streaming, but I do not find related connector jar can be used. > I want to check whether fully

[Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently?

2023-02-16 Thread hueiyuan su
*Component*: Spark Structured Streaming *Level*: Advanced *Scenario*: How-to *Problems Description* I would like to implement witeStream data to AWS Kinesis with Spark structured Streaming, but I do not find related connector jar can be used. I want to check whether fully

Re: Need help with the configuration for AWS glue jobs

2022-06-23 Thread Sid
Where can I find information on the size of the datasets supported by AWS Glue? I didn't see it on the documentation Also, if I want to process TBs of data for eg 1TB what should be the ideal EMR cluster configuration? Could you please guide me on this? Thanks, Sid. On Thu, 23 Jun 2022, 23:44

Re: Need help with the configuration for AWS glue jobs

2022-06-23 Thread Gourav Sengupta
Please use EMR, Glue is not made for heavy processing jobs. On Thu, Jun 23, 2022 at 6:36 AM Sid wrote: > Hi Team, > > Could anyone help me in the below problem: > > > https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-p

Need help with the configuration for AWS glue jobs

2022-06-22 Thread Sid
Hi Team, Could anyone help me in the below problem: https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data Thanks, Sid

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Gourav Sengupta
arquet false > > see > https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45 > > On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote: > > Hi, > > > > I received a response from AWS, this is an issue with EMR, and they are

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Nicolas Paris
as a workaround turn off pruning : spark.sql.hive.metastorePartitionPruning false spark.sql.hive.convertMetastoreParquet false see https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45 On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote: > Hi, &g

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-24 Thread Gourav Sengupta
Hi, I received a response from AWS, this is an issue with EMR, and they are working on resolving the issue I believe. Thanks and Regards, Gourav Sengupta On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta < gourav.sengupta.develo...@gmail.com> wrote: > Hi, > > the query still gives

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Gourav Sengupta
Hi, the query still gives the same error if we write "SELECT * FROM table_name WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS". Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0. Thanks and Regards, Gourav Sengupta On Mon, Aug 23, 2021 at 1:16 PM Sean Owen wrote: > Date

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Sean Owen
Date handling was tightened up in Spark 3. I think you need to compare to a date literal, not a string literal. On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta < gourav.sengupta.develo...@gmail.com> wrote: > Hi, > > while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT * > FROM

AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Gourav Sengupta
Hi, while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT * FROM WHERE > '2021-03-01'" the query is failing with error: --- pyspark.sql.utils.AnalysisException:

Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

2021-02-18 Thread Bin Fan
Hi everyone! I am sharing this article about running Spark / Presto workloads on AWS: Bursting On-Premise Datalake Analytics and AI Workloads on AWS <https://bit.ly/3qA1Tom> published on AWS blog. Hope you enjoy it. Feel free to discuss with me here <https://alluxio.io/slack>. - Bin

Spark in hybrid cloud in AWS & GCP

2020-12-07 Thread Bin Fan
Dear Spark users, If you are interested in running Spark in Hybrid Cloud? Checkout talks from AWS & GCP at the virtual Data Orchestration Summit <https://www.alluxio.io/data-orchestration-summit-2020/> on Dec. 8-9, 2020, register for free <https://www.alluxio.io/data-orchestratio

Re: Spark Job Fails with Unknown Error writing to S3 from AWS EMR

2020-07-22 Thread Shriraj Bhardwaj
We faced this similar situation with jre 8u262 try reverting back... On Thu, Jul 23, 2020, 5:18 AM koti reddy wrote: > Hi, > > Can someone help to resolve this issue? > Thank you in advance. > > Error logs : > > java.io.EOFException: Unexpected EOF while trying to read response from server >

Spark Job Fails with Unknown Error writing to S3 from AWS EMR

2020-07-22 Thread koti reddy
Hi, Can someone help to resolve this issue? Thank you in advance. Error logs : java.io.EOFException: Unexpected EOF while trying to read response from server at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402) at

AWS EMR slow write to HDFS

2019-06-11 Thread Femi Anthony
I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output committer. Is this the case, and if so - when will

Re: Aws

2019-02-08 Thread Pedro Tuero
same link, it says that dynamic allocation is true by default. I >>> thought it would do the trick but reading again I think it is related to >>> the number of executors rather than the number of cores. >>> >>> But the jobs are still taking more than before. >&g

Re: Aws

2019-02-07 Thread Noritaka Sekiyama
t it would do the trick but reading again I think it is related to >> the number of executors rather than the number of cores. >> >> But the jobs are still taking more than before. >> Watching application history, I see these differences: >> For the same job, the same kind of

Re: Aws

2019-02-07 Thread Hiroyuki Nagata
e number of executors rather than the number of cores. > > But the jobs are still taking more than before. > Watching application history, I see these differences: > For the same job, the same kind of instances types, default (aws managed) > configuration for executors, cores, and mem

Re: Aws

2019-02-01 Thread Pedro Tuero
application history, I see these differences: For the same job, the same kind of instances types, default (aws managed) configuration for executors, cores, and memory: Instances: 6 r5.xlarge : 4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances * 4 cores). With 5.16: - 24 executors (4 in each

Re: Aws

2019-01-31 Thread Hiroyuki Nagata
Hi, Pedro I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for performance tuning. Do you configure dynamic allocation ? FYI: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation I've not tested it yet. I guess spark-submit needs to specify

Aws

2019-01-31 Thread Pedro Tuero
Hi guys, I use to run spark jobs in Aws emr. Recently I switch from aws emr label 5.16 to 5.20 (which use Spark 2.4.0). I've noticed that a lot of steps are taking longer than before. I think it is related to the automatic configuration of cores by executor. In version 5.16, some executors toke

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Riccardo Ferrari
, On Fri, Dec 21, 2018 at 1:18 PM Aakash Basu wrote: > Any help, anyone? > > On Fri, Dec 21, 2018 at 2:21 PM Aakash Basu > wrote: > >> Hey Shuporno, >> >> With the updated config too, I am getting the same error. While trying to >> figure that out, I found t

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Any help, anyone? On Fri, Dec 21, 2018 at 2:21 PM Aakash Basu wrote: > Hey Shuporno, > > With the updated config too, I am getting the same error. While trying to > figure that out, I found this link which says I need aws-java-sdk (which I > already have): > https://github.c

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Hey Shuporno, With the updated config too, I am getting the same error. While trying to figure that out, I found this link which says I need aws-java-sdk (which I already have): https://github.com/amazon-archives/kinesis-storm-spout/issues/8 Now, this is my java details: java version "1.8.

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Shuporno Choudhury
> > > > Thanks, > Aakash. > > On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <[hidden email] > <http:///user/SendEmail.jtp?type=node=34217=0>> wrote: > >> >> >> On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <[hidden email] >&g

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
ec 21, 2018 at 12:51 PM Shuporno Choudhury < shuporno.choudh...@gmail.com> wrote: > > > On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury < > shuporno.choudh...@gmail.com> wrote: > >> Hi, >> Your connection config uses 's3n' but your read command uses 's3a'

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Shuporno Choudhury
is should solve the problem. > > On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List] < > ml+s1001560n34215...@n3.nabble.com> wrote: > >> Hi, >> >> I am trying to connect to AWS S3 and read a csv file (running POC) from a >> bucket. >

Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hi, I am trying to connect to AWS S3 and read a csv file (running POC) from a bucket. I have s3cmd and and being able to run ls and other operation from cli. *Present Configuration:* Python 3.7 Spark 2.3.1 *JARs added:* hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark) aws

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-11-15 Thread Holden Karau
t; Please consider the environment before printing. > > > > > > > > *From: *Li Gao > *Date: *Thursday, November 1, 2018 4:56 > *To: *"Zhang, Yuqi" > *Cc: *Gourav Sengupta , "user@spark.apache.org" > , "Nogami, Masatsugu" > > *S

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
ts. Instead, please notify the sender and delete the e-mail and any attachments. Thank you. Please consider the environment before printing. From: Li Gao Date: Thursday, November 1, 2018 4:56 To: "Zhang, Yuqi" Cc: Gourav Sengupta , "user@spark.apache.org" , "Nogami, Masa

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
ts. Instead, please > notify the sender and delete the e-mail and any attachments. Thank you. > > Please consider the environment before printing. > > > > > > > > *From: *Li Gao > *Date: *Thursday, November 1, 2018 0:07 > *To: *"Zhang, Yuqi" > *

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
s. Thank you. Please consider the environment before printing. From: Li Gao Date: Thursday, November 1, 2018 0:07 To: "Zhang, Yuqi" Cc: "gourav.sengu...@gmail.com" , "user@spark.apache.org" , "Nogami, Masatsugu" Subject: Re: [Spark Shell on AWS K8s

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
as the driver client. -Li On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi wrote: > Hi Gourav, > > > > Thank you for your reply. > > > > I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws > instances? > > I could set up the k8s cluster on A

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Gourav, Thank you for your reply. I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws instances? I could set up the k8s cluster on AWS, but my problem is don’t know how to run spark-shell on kubernetes… Since spark only support client mode on k8s from 2.4 version which

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Biplob Biswas
n on kubernetes cluster, so I >> would like to ask if there is some solution to my problem. >> >> >> >> The problem is when I am trying to run spark-shell on kubernetes v1.11.3 >> cluster on AWS environment, I couldn’t successfully run stateful set using >>

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Gourav Sengupta
g spark 2.4 client mode function on kubernetes cluster, so I > would like to ask if there is some solution to my problem. > > > > The problem is when I am trying to run spark-shell on kubernetes v1.11.3 > cluster on AWS environment, I couldn’t successfully run stateful set using >

[Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-28 Thread Zhang, Yuqi
cluster on AWS environment, I couldn’t successfully run stateful set using the docker image built from spark 2.4. The error message is showing below. The version I am using is spark v2.4.0-rc3. Also, I wonder if there is more documentation on how to use client-mode or integrate spark-shell

Re: AWS credentials needed while trying to read a model from S3 in Spark

2018-05-09 Thread Srinath C
You could use IAM roles in AWS to access the data in S3 without credentials. See this link <https://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_s3.html> and this link <http://parthicloud.com/how-to-access-s3-bucket-from-application-on-amazon-ec2-without-access-cr

AWS credentials needed while trying to read a model from S3 in Spark

2018-05-09 Thread Mina Aslani
Hi, I am trying to load a ML model from AWS S3 in my spark app running in a docker container, however I need to pass the AWS credentials. My questions is, why do I need to pass the credentials in the path? And what is the workaround? Best regards, Mina

Spark Structured Streaming how to read data from AWS SQS

2017-12-11 Thread Bogdan Cojocar
For spark streaming there are connectors that can achieve this functionality. Unfortunately for spark structured streaming I couldn't find any as it's a newer technology. Is there a way to connect to a source using a spark streaming connector? Or is

Re: Quick one... AWS SDK version?

2017-10-08 Thread Jonathan Kelly
Tushar, Yes, the hadoop-aws jar installed on an emr-5.8.0 cluster was built with AWS Java SDK 1.11.160, if that’s what you mean. ~ Jonathan On Sun, Oct 8, 2017 at 8:42 AM Tushar Sudake <etusha...@gmail.com> wrote: > Hi Jonathan, > > Does that mean Hadoop-AWS 2.7.3 too is built

Re: Quick one... AWS SDK version?

2017-10-08 Thread Tushar Sudake
Hi Jonathan, Does that mean Hadoop-AWS 2.7.3 too is built against AWS SDK 1.11.160 and not 1.7.4? Thanks. On Oct 7, 2017 3:50 PM, "Jean Georges Perrin" <j...@jgp.net> wrote: Hey Marco, I am actually reading from S3 and I use 2.7.3, but I inherited the project and they use s

Re: Quick one... AWS SDK version?

2017-10-07 Thread Jean Georges Perrin
Hey Marco, I am actually reading from S3 and I use 2.7.3, but I inherited the project and they use some AWS API from Amazon SDK, which version is like from yesterday :) so it’s confused and AMZ is changing its version like crazy so it’s a little difficult to follow. Right now I went back

Re: Quick one... AWS SDK version?

2017-10-07 Thread Marco Mistroni
Hi JG out of curiosity what's ur usecase? are you writing to S3? you could use Spark to do that , e.g using hadoop package org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client which is in line with hadoop 2.7.1? hth marco On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly

Re: Quick one... AWS SDK version?

2017-10-06 Thread Jonathan Kelly
Note: EMR builds Hadoop, Spark, et al, from source against specific versions of certain packages like the AWS Java SDK, httpclient/core, Jackson, etc., sometimes requiring some patches in these applications in order to work with versions of these dependencies that differ from what the applications

Re: Quick one... AWS SDK version?

2017-10-04 Thread Steve Loughran
On 3 Oct 2017, at 21:37, JG Perrin <jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote: Sorry Steve – I may not have been very clear: thinking about aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled with Spark. I know, but if you are talking to s3

RE: Quick one... AWS SDK version?

2017-10-03 Thread JG Perrin
Sorry Steve - I may not have been very clear: thinking about aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled with Spark. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Tuesday, October 03, 2017 2:20 PM To: JG Perrin <jper...@lumeris.com> Cc

RE: Quick one... AWS SDK version?

2017-10-03 Thread JG Perrin
Thanks Yash… this is helpful! From: Yash Sharma [mailto:yash...@gmail.com] Sent: Tuesday, October 03, 2017 1:02 AM To: JG Perrin <jper...@lumeris.com>; user@spark.apache.org Subject: Re: Quick one... AWS SDK version? Hi JG, Here are my cluster configs if it helps. Cheers. EMR: emr

Re: Quick one... AWS SDK version?

2017-10-03 Thread Steve Loughran
On 3 Oct 2017, at 02:28, JG Perrin <jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote: Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? You generally to have to stick with the version which hadoop was built with

Re: Quick one... AWS SDK version?

2017-10-03 Thread Yash Sharma
Hi JG, Here are my cluster configs if it helps. Cheers. EMR: emr-5.8.0 Hadoop distribution: Amazon 2.7.3 AWS sdk: /usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.160.jar Applications: Hive 2.3.0 Spark 2.2.0 Tez 0.8.4 On Tue, 3 Oct 2017 at 12:29 JG Perrin <jper...@lumeris.com> wrote:

Quick one... AWS SDK version?

2017-10-02 Thread JG Perrin
Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? Thanks! jg

Spark ES Connector -- AWS Managed ElasticSearch Services

2017-08-01 Thread Deepak Sharma
I am tying to connect to AWS managed ES service using Spark ES Connector , but am not able to. I am passing es.nodes and es.port along with es.nodes.wan.only set to true. But it fails with below error: 34 ERROR NetworkClient: Node [x.x.x.x:443] failed (The server x.x.x.x failed to respond

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
t 4:59 PM, Takashi Sasaki <tsasaki...@gmail.com> > wrote: >> >> Hi Pascal, >> >> The error also occurred frequently in our project. >> >> As a solution, it was effective to specify the memory size directly >> with spark-submit command. >> >>

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Josh Holbrook
it was effective to specify the memory size directly > with spark-submit command. > > eg. spark-submit executor-memory 2g > > > Regards, > > Takashi > > > 2017-07-18 5:18 GMT+09:00 Pascal Stammer <stam...@deichbrise.de>: > >> Hi, > >> > >>

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer
spark-submit command. > > eg. spark-submit executor-memory 2g > > > Regards, > > Takashi > >> 2017-07-18 5:18 GMT+09:00 Pascal Stammer <stam...@deichbrise.de>: >>> Hi, >>> >>> I am running a Spark 2.1.x Application on

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
t;: >> Hi, >> >> I am running a Spark 2.1.x Application on AWS EMR with YARN and get >> following error that kill my application: >> >> AM Container for appattempt_1500320286695_0001_01 exited with exitCode: >> -104 >> For more detailed output, chec

Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer
Hi, I am running a Spark 2.1.x Application on AWS EMR with YARN and get following error that kill my application: AM Container for appattempt_1500320286695_0001_01 exited with exitCode: -104 For more detailed output, check application tracking page:http://ip-172-31-35-192.eu-central-1

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread lucas.g...@gmail.com
"Building data products is a very different discipline from that of building software." That is a fundamentally incorrect assumption. There will always be a need for figuring out how to apply said principles, but saying 'we're different' has always turned out to be incorrect and I have seen no

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 12 Apr 2017, at 17:25, Gourav Sengupta > wrote: Hi, Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Gourav Sengupta
Hi, Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine functional requirement to see a color of a button in green on the screen. Perhaps it may be pertinent to read the first

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 11 Apr 2017, at 20:46, Gourav Sengupta > wrote: And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen. While I'm

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sumona Routh
ou can do some build test before even > submitting it to a remote cluster > > On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote: > > Hi Shyla > > You have multiple options really some of which have been already listed > but let me try and clarify > >

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Gourav Sengupta
set up for some CI workflow, that can do scheduled >>>> builds and tests. Works well if you can do some build test before even >>>> submitting it to a remote cluster >>>> >>>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wr

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sam Elamin
;>> >>> Hi Shyla >>> >>> You have multiple options really some of which have been already listed >>> but let me try and clarify >>> >>> Assuming you have a spark application in a jar you have a variety of >>> options >>> >

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Steve Loughran
ing spark cluster that is either running on EMR or somewhere else. Super simple / hacky Cron job on EC2 that calls a simple shell script that does a spart submit to a Spark Cluster OR create or add step to an EMR cluster More Elegant Airflow/Luigi/AWS Data Pipeline (Which is just CRON in th

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
unning on EMR >> or somewhere else. >> >> *Super simple / hacky* >> Cron job on EC2 that calls a simple shell script that does a spart submit >> to a Spark Cluster OR create or add step to an EMR cluster >> >> *More Elegant* >> Airflow/Luigi/AWS Data Pipel

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Gourav Sengupta
hat calls a simple shell script that does a spart submit > to a Spark Cluster OR create or add step to an EMR cluster > > *More Elegant* > Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will > do the above step but have scheduling and potential backfill

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Steve Loughran
cky Cron job on EC2 that calls a simple shell script that does a spart submit to a Spark Cluster OR create or add step to an EMR cluster More Elegant Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do the above step but have scheduling and potential backfilling and err

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
/ hacky* Cron job on EC2 that calls a simple shell script that does a spart submit to a Spark Cluster OR create or add step to an EMR cluster *More Elegant* Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do the above step but have scheduling and potential backfilling

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Gourav Sengupta
Hi Shyla, why would you want to schedule a spark job in EC2 instead of EMR? Regards, Gourav On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandesh...@gmail.com> wrote: > I want to run a spark batch job maybe hourly on AWS EC2 . What is the > easiest way to do this. Thanks >

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Yash Sharma
On Fri, 7 Apr 2017 at 10:04 shyla deshpande <deshpandesh...@gmail.com> wrote: > I want to run a spark batch job maybe hourly on AWS EC2 . What is the > easiest way to do this. Thanks >

What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 . What is the easiest way to do this. Thanks

What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 . What is the easiest way to do this. Thanks

Consuming AWS Cloudwatch logs from Kinesis into Spark

2017-04-05 Thread Tim Smith
I am sharing this code snippet since I spent quite some time figuring it out and I couldn't find any examples online. Between the Kinesis documentation, tutorial on AWS site and other code snippets on the Internet, I was confused about structure/format of the messages that Spark fetches from

Spark is inventing its own AWS secret key

2017-03-08 Thread Jonhy Stack
a.secret.key", "SECRETKEY") hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") logs = spark_context.textFile("s3a://mybucket/logs/*) Spark was saying Invalid Access key [ACCESSKEY] However with the same ACCESSKEY and SECRETKEY this was workin

Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Prithish
nded there just now. See > http://stackoverflow.com/questions/42452622/custom- > log4j-properties-on-aws-emr/42516161#42516161 > > In short, an hdfs:// path can't be used to configure log4j because log4j > knows nothing about hdfs. Instead, since you are using EMR, you should us

Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Jonathan Kelly
Prithish, I saw you posted this on SO, so I responded there just now. See http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr/42516161#42516161 In short, an hdfs:// path can't be used to configure log4j because log4j knows nothing about hdfs. Instead, since you

Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
rking when > running on my local Yarn setup. > > Any ideas? > > I have also posted on Stackoverflow (link below) > http://stackoverflow.com/questions/42452622/custom- > log4j-properties-on-aws-emr > > >

Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Steve Loughran
ad of hdfs. None of this seem to work. However, I can get this working when running on my local Yarn setup. Any ideas? I have also posted on Stackoverflow (link below) http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr

Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
. However, I can get this working when running on my local Yarn setup. Any ideas? I have also posted on Stackoverflow (link below) http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr

Spark streaming on AWS EC2 error . Please help

2017-02-20 Thread shyla deshpande
I am running Spark streaming on AWS EC2 in standalone mode. When I do a spark-submit, I get the following message. I am subscribing to 3 kafka topics and it is reading and processing just 2 topics. Works fine in local mode. Appreciate your help. Thanks Exception in thread "pool-26-threa

Re: Spark Read from Google store and save in AWS s3

2017-01-10 Thread A Shaikh
t;) > > spark = SparkSession.builder\ > .config(conf=sc.getConf())\ > .getOrCreate() > > dfTermRaw = spark.read.format("csv")\ > .option("header", "true")\ > .option("delimiter"

Re: Spark Read from Google store and save in AWS s3

2017-01-07 Thread neil90
spark = SparkSession.builder\ .config(conf=sc.getConf())\ .getOrCreate() dfTermRaw = spark.read.format("csv")\ .option("header", "true")\ .option("delimiter" ,"\t")\ .option("inferSchema", "true")\

Re: Spark Read from Google store and save in AWS s3

2017-01-06 Thread Steve Loughran
On 5 Jan 2017, at 20:07, Manohar Reddy > wrote: Hi Steve, Thanks for the reply and below is follow-up help needed from you. Do you mean we can set up two native file system to single sparkcontext ,so then based on urls

  1   2   3   >