Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS
if it runs out of memory on a per-executor basis. This could happen when
evaluating a cache operation like you have below or during shuffle
operations in joins, etc. You might try to increase executor memory, tune
shuffle operations, avoid caching, or reduce the size of your dataframe(s).

Jack

On Sep 8, 2023 at 12:43:07, Nebi Aydin <nayd...@binghamton.edu.invalid>
wrote:

>
> Sure
> df = spark.read.option("basePath",
> some_path).parquet(*list_of_s3_file_paths())
> (
>     df
>     .where(SOME FILTER)
>     .repartition(60000)
>     .cache()
> )
>
> On Fri, Sep 8, 2023 at 14:56 Jack Wells <j...@tecton.ai.invalid> wrote:
>
>> Hi Nebi, can you share the code you’re using to read and write from S3?
>>
>> On Sep 8, 2023 at 10:59:59, Nebi Aydin <nayd...@binghamton.edu.invalid>
>> wrote:
>>
>>> Hi all,
>>> I am using spark on EMR to process data. Basically i read data from AWS
>>> S3 and do the transformation and post transformation i am loading/writing
>>> data to s3.
>>>
>>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too
>>> high.
>>>
>>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>>
>>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>>> creating blocks and writing data into it. We are going all the operations
>>> in memory.
>>>
>>> Any specific operation writing data to datanode(HDFS)?
>>>
>>> Here is the hdfs dirs created.
>>>
>>> ```
>>>
>>> *15.4G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>>
>>> 129G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>>
>>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>>
>>> ```
>>>
>>>
>>> <https://stackoverflow.com/collectives/aws>
>>>
>>

Reply via email to