Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Usually job never reaches that point fails during shuffle. And storage
memory and executor memory when it failed is usually low
On Fri, Sep 8, 2023 at 16:49 Jack Wells  wrote:

> Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS
> if it runs out of memory on a per-executor basis. This could happen when
> evaluating a cache operation like you have below or during shuffle
> operations in joins, etc. You might try to increase executor memory, tune
> shuffle operations, avoid caching, or reduce the size of your dataframe(s).
>
> Jack
>
> On Sep 8, 2023 at 12:43:07, Nebi Aydin 
> wrote:
>
>>
>> Sure
>> df = spark.read.option("basePath",
>> some_path).parquet(*list_of_s3_file_paths())
>> (
>> df
>> .where(SOME FILTER)
>> .repartition(6)
>> .cache()
>> )
>>
>> On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:
>>
>>> Hi Nebi, can you share the code you’re using to read and write from S3?
>>>
>>> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
>>> wrote:
>>>
 Hi all,
 I am using spark on EMR to process data. Basically i read data from AWS
 S3 and do the transformation and post transformation i am loading/writing
 data to s3.

 Recently we have found that hdfs(/mnt/hdfs) utilization is going too
 high.

 I disabled `yarn.log-aggregation-enable` by setting it to False.

 I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
 creating blocks and writing data into it. We are going all the operations
 in memory.

 Any specific operation writing data to datanode(HDFS)?

 Here is the hdfs dirs created.

 ```

 *15.4G
 /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1

 129G
 /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized

 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current

 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812

 129G /mnt/hdfs/current 129G /mnt/hdfs*

 ```


 

>>>


Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
 Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS
if it runs out of memory on a per-executor basis. This could happen when
evaluating a cache operation like you have below or during shuffle
operations in joins, etc. You might try to increase executor memory, tune
shuffle operations, avoid caching, or reduce the size of your dataframe(s).

Jack

On Sep 8, 2023 at 12:43:07, Nebi Aydin 
wrote:

>
> Sure
> df = spark.read.option("basePath",
> some_path).parquet(*list_of_s3_file_paths())
> (
> df
> .where(SOME FILTER)
> .repartition(6)
> .cache()
> )
>
> On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:
>
>> Hi Nebi, can you share the code you’re using to read and write from S3?
>>
>> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
>> wrote:
>>
>>> Hi all,
>>> I am using spark on EMR to process data. Basically i read data from AWS
>>> S3 and do the transformation and post transformation i am loading/writing
>>> data to s3.
>>>
>>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too
>>> high.
>>>
>>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>>
>>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>>> creating blocks and writing data into it. We are going all the operations
>>> in memory.
>>>
>>> Any specific operation writing data to datanode(HDFS)?
>>>
>>> Here is the hdfs dirs created.
>>>
>>> ```
>>>
>>> *15.4G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>>
>>> 129G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>>
>>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>>
>>> ```
>>>
>>>
>>> 
>>>
>>


Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Sure
df = spark.read.option("basePath",
some_path).parquet(*list_of_s3_file_paths())
(
df
.where(SOME FILTER)
.repartition(6)
.cache()
)

On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:

> Hi Nebi, can you share the code you’re using to read and write from S3?
>
> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
> wrote:
>
>> Hi all,
>> I am using spark on EMR to process data. Basically i read data from AWS
>> S3 and do the transformation and post transformation i am loading/writing
>> data to s3.
>>
>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.
>>
>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>
>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>> creating blocks and writing data into it. We are going all the operations
>> in memory.
>>
>> Any specific operation writing data to datanode(HDFS)?
>>
>> Here is the hdfs dirs created.
>>
>> ```
>>
>> *15.4G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>
>> 129G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>
>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>
>> ```
>>
>>
>> 
>>
>