RE: Small file problem

2021-06-17 Thread Boris Litvak
Compact them and remove the small files. One messy way of doing this, (including some cleanup) looks like the following, based on rdd.mapPartitions() on the file urls rdd: import gzip import json import logging import math import re from typing import List import boto3 from mypy_boto3_s3.client

RE: Reading parquet files in parallel on the cluster

2021-05-30 Thread Boris Litvak
Eric, can you do the following: 1. List the files on driver and parallelize the file names as a DataFrame, based on directory name 2. Compact the files in each directory in a task on the executor. Alternatively and easier, you can just go over the directories in the driver using a simple

RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, not writes. From: Tzahi File Sent: Wednesday, 7 April 2021 16:02 To: Hariharan Cc: user Subject: Re: Spark performance over S3 Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The

Data Lakes using Spark

2021-04-07 Thread Boris Litvak
, if some points are unclear or misleading, please state so. Thanks, Boris Litvak

RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Hi Tzahi, I don’t know the reasons for that, though I’d check for fs.s3a implementation to be using multipart uploads, which I assume it does. I would say that none of the comments in the link are relevant to you, as the VPC endpoint is more of a security rather than performance feature. I

RE: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Boris Litvak
P.S.: 3. If fast updates are required, one way would be capturing S3 events & putting the paths/modifications dates/etc of the paths into DynamoDB/your DB of choice. From: Boris Litvak Sent: Tuesday, 16 March 2021 9:03 To: Ben Kaylor ; Alchemist Cc: User Subject: RE: How to make bu

RE: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Boris Litvak
Ben, I’d explore these approaches: 1. To address your problem, I’d setup an inventory for the S3 bucket: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. Then you can list the files from the inventory. Have not tried this myself. Note that the inventory update

RE: S3a Committer

2021-02-02 Thread Boris Litvak
This December AWS announced https://aws.amazon.com/s3/consistency/, are you sure this is your problem? I think all these s3guard like wrappers are irrelevant right now. Please correct me if I am wrong. From: David Morin Sent: Tuesday, 2 February 2021 22:26 To: user@spark.apache.org Subject:

RE: [Spark Structured Streaming] Processing the data path coming from kafka.

2021-01-18 Thread Boris Litvak
truncate', False).start() q.awaitTermination() From: Amit Joshi Sent: Monday, 18 January 2021 20:22 To: Boris Litvak Cc: spark-user Subject: Re: [Spark Structured Streaming] Processing the data path coming from kafka. Hi Boris, Thanks for your code block. I understood what you are trying to achieve in

RE: [Spark Structured Streaming] Processing the data path coming from kafka.

2021-01-18 Thread Boris Litvak
17/spark-explode-nested-json-with-array-in-scala Note that unless I am missing something you cannot access spark session from foreach as code is not running on the driver. Please say if it makes sense or did I miss anything. Boris From: Amit Joshi Sent: Monday, 18 January 2021 17:10 To: Boris

RE: [Spark Structured Streaming] Processing the data path coming from kafka.

2021-01-18 Thread Boris Litvak
Hi Amit, Why won’t you just map()/mapXXX() the kafkaDf with the mapping function that reads the paths? Also, do you really have to read the json into an additional dataframe? Thanks, Boris From: Amit Joshi Sent: Monday, 18 January 2021 15:04 To: spark-user Subject: [Spark Structured