Sure df = spark.read.option("basePath", some_path).parquet(*list_of_s3_file_paths()) ( df .where(SOME FILTER) .repartition(60000) .cache() )
On Fri, Sep 8, 2023 at 14:56 Jack Wells <j...@tecton.ai.invalid> wrote: > Hi Nebi, can you share the code you’re using to read and write from S3? > > On Sep 8, 2023 at 10:59:59, Nebi Aydin <nayd...@binghamton.edu.invalid> > wrote: > >> Hi all, >> I am using spark on EMR to process data. Basically i read data from AWS >> S3 and do the transformation and post transformation i am loading/writing >> data to s3. >> >> Recently we have found that hdfs(/mnt/hdfs) utilization is going too high. >> >> I disabled `yarn.log-aggregation-enable` by setting it to False. >> >> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is >> creating blocks and writing data into it. We are going all the operations >> in memory. >> >> Any specific operation writing data to datanode(HDFS)? >> >> Here is the hdfs dirs created. >> >> ``` >> >> *15.4G >> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1 >> >> 129G >> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized >> >> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current >> >> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812 >> >> 129G /mnt/hdfs/current 129G /mnt/hdfs* >> >> ``` >> >> >> <https://stackoverflow.com/collectives/aws> >> >