Compact them and remove the small files.
One messy way of doing this, (including some cleanup) looks like the following,
based on rdd.mapPartitions() on the file urls rdd:
import gzip
import json
import logging
import math
import re
from typing import List
import boto3
from mypy_boto3_s3.client
Eric, can you do the following:
1. List the files on driver and parallelize the file names as a DataFrame,
based on directory name
2. Compact the files in each directory in a task on the executor.
Alternatively and easier, you can just go over the directories in the driver
using a simple
Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed,
not writes.
From: Tzahi File
Sent: Wednesday, 7 April 2021 16:02
To: Hariharan
Cc: user
Subject: Re: Spark performance over S3
Hi Hariharan,
Thanks for your reply.
In both cases we are writing the data to S3. The
, if some points are unclear or misleading, please state so.
Thanks,
Boris Litvak
Hi Tzahi,
I don’t know the reasons for that, though I’d check for fs.s3a implementation
to be using multipart uploads, which I assume it does.
I would say that none of the comments in the link are relevant to you, as the
VPC endpoint is more of a security rather than performance feature.
I
P.S.: 3. If fast updates are required, one way would be capturing S3 events &
putting the paths/modifications dates/etc of the paths into DynamoDB/your DB of
choice.
From: Boris Litvak
Sent: Tuesday, 16 March 2021 9:03
To: Ben Kaylor ; Alchemist
Cc: User
Subject: RE: How to make bu
Ben, I’d explore these approaches:
1. To address your problem, I’d setup an inventory for the S3 bucket:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html.
Then you can list the files from the inventory. Have not tried this myself.
Note that the inventory update
This December AWS announced https://aws.amazon.com/s3/consistency/, are you
sure this is your problem?
I think all these s3guard like wrappers are irrelevant right now. Please
correct me if I am wrong.
From: David Morin
Sent: Tuesday, 2 February 2021 22:26
To: user@spark.apache.org
Subject:
truncate', False).start()
q.awaitTermination()
From: Amit Joshi
Sent: Monday, 18 January 2021 20:22
To: Boris Litvak
Cc: spark-user
Subject: Re: [Spark Structured Streaming] Processing the data path coming from
kafka.
Hi Boris,
Thanks for your code block.
I understood what you are trying to achieve in
17/spark-explode-nested-json-with-array-in-scala
Note that unless I am missing something you cannot access spark session from
foreach as code is not running on the driver.
Please say if it makes sense or did I miss anything.
Boris
From: Amit Joshi
Sent: Monday, 18 January 2021 17:10
To: Boris
Hi Amit,
Why won’t you just map()/mapXXX() the kafkaDf with the mapping function that
reads the paths?
Also, do you really have to read the json into an additional dataframe?
Thanks, Boris
From: Amit Joshi
Sent: Monday, 18 January 2021 15:04
To: spark-user
Subject: [Spark Structured
11 matches
Mail list logo