Sure
df = spark.read.option("basePath",
some_path).parquet(*list_of_s3_file_paths())
(
    df
    .where(SOME FILTER)
    .repartition(60000)
    .cache()
)

On Fri, Sep 8, 2023 at 14:56 Jack Wells <j...@tecton.ai.invalid> wrote:

> Hi Nebi, can you share the code you’re using to read and write from S3?
>
> On Sep 8, 2023 at 10:59:59, Nebi Aydin <nayd...@binghamton.edu.invalid>
> wrote:
>
>> Hi all,
>> I am using spark on EMR to process data. Basically i read data from AWS
>> S3 and do the transformation and post transformation i am loading/writing
>> data to s3.
>>
>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.
>>
>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>
>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>> creating blocks and writing data into it. We are going all the operations
>> in memory.
>>
>> Any specific operation writing data to datanode(HDFS)?
>>
>> Here is the hdfs dirs created.
>>
>> ```
>>
>> *15.4G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>
>> 129G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>
>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>
>> ```
>>
>>
>> <https://stackoverflow.com/collectives/aws>
>>
>

Reply via email to