Re: Help with Shuffle Read performance

2022-09-29 Thread Vladimir Prus
e, but not enough to saturate the cluster resources. > > Did I miss some more tuning parameters that could help? > One obvious thing would be to vertically increase the machines and use > less nodes to minimize traffic, but 30 nodes doesn't seem like much even > considering 30x30 connections. > > Thanks in advance! > > -- Vladimir Prus http://vladimirprus.com

Is BindingParquetOutputCommitter still used?

2021-09-08 Thread Vladimir Prus
ormat, where it is copied to spark.sql.sources.outputCommitterClass, and that option, in turn, is only used by SQLHadoopMapReduceCommitProtocol - which we don't use here. So, it sounds like setting parquet.output.committer.class to org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter is no longer necessary? Or is there some code path where it matters? -- Vladimir Prus http://vladimirprus.com

Re: Spark performance over S3

2021-04-07 Thread Vladimir Prus
tion.maximum config param from 200 to 400 or 900 >> but it didn't reduce the S3 latency. >> >> Do you have any idea for the cause of the read latency from S3? >> >> I saw this post >> <https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/> >> to >> improve the transfer speed, is something here relevant? >> >> >> Thanks, >> Tzahi >> > -- Vladimir Prus http://vladimirprus.com

Re: Issue with accessing S3 from EKS spark pod

2021-02-10 Thread Vladimir Prus
1 6264277897 <+91+626+427+7897> > [image: ThoughtWorks] > <http://www.thoughtworks.com/?utm_campaign=prajwal-boloor-signature_medium=email_source=thoughtworks-email-signature-generator> > > > > > On Tue, Feb 9, 2021 at 10:44 PM Vladimir Prus > wrote: > >>

Re: Issue with accessing S3 from EKS spark pod

2021-02-09 Thread Vladimir Prus
On 9 Feb 2021, at 19:46, Rishabh Jain wrote: Hi, We are trying to access S3 from spark job running on EKS cluster pod. I have a service account that has an IAM role attached with full S3 permission. We are using DefaultCredentialsProviderChain. But still we are getting 403 Forbidden from S3.

Re: High level explanation of dropDuplicates

2019-06-12 Thread Vladimir Prus
Hi, If your data frame is partitioned by column A, and you want deduplication by columns A, B and C, then a faster way might be to sort each partition by A, B and C and then do a linear scan - it is often faster than group by all columns - which require a shuffle. Sadly, there's no standard way

State of datasource api v2

2019-01-14 Thread Vladimir Prus
nd whether these limitations above will be fixed. Thanks in advance, -- Vladimir Prus http://vladimirprus.com