unsubscribe

2024-05-01 Thread Nebi Aydin
unsubscribe


About shuffle partition size

2023-12-20 Thread Nebi Aydin
Hi all,
What happens when # of unique join keys less than shuffle partitions?
Are we going to end up with lots of empty partitions?
If yes,is there any point to have shuffle partitions bigger than # of
unique join keys?


Thread dump only shows 10 shuffle clients

2023-09-28 Thread Nebi Aydin
Hi all,
I set the spark.shuffle.io.serverThreads and spark.shuffle.io.clientThreads
to *800*
But when I click Thread dump from the Spark UI for the executor: I only see
10 shuffle client threads for the executor.
Is that normal, am I missing something?


Files io threads vs shuffle io threads

2023-09-27 Thread Nebi Aydin
Hi all,
Can someone explain the difference between
Files io threads and shuffle io threads, as I couldn't find any explanation.
I'm specifically asking about these:
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.io.threads

spark.files.io.serverThreads
spark.files.io.clientThreads
spark.files.io.threads


About Peak Jvm Memory Onheap

2023-09-17 Thread Nebi Aydin
Hi all,
I couldn't find any useful doc that explains `*Peak JVM Memory Onheap`*
field on Spark UI.
Most of the time my applications have very low *On heap storage memory
*and *Peak
execution memory on heap*
But have very big `*Peak JVM Memory Onheap`.* on Spark UI
Can someone please explain the diff between these metrics?


[Spark Core]: How does rpc threads influence shuffle?

2023-09-15 Thread Nebi Aydin
Hello all,
I know that these parameters exist for shuffle tuning:



*spark.shuffle.io.serverThreadsspark.shuffle.io.clientThreadsspark.shuffle.io.threads*
But we also have


*spark.rpc.io.serverThreadsspark.rpc.io.clientThreadsspark.rpc.io.threads*

So specifically talking about *Shuffling, *what's the influence of rpc
related thread configurations? What's the relationship between rpc threads
shuffle threads?


Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Usually job never reaches that point fails during shuffle. And storage
memory and executor memory when it failed is usually low
On Fri, Sep 8, 2023 at 16:49 Jack Wells  wrote:

> Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS
> if it runs out of memory on a per-executor basis. This could happen when
> evaluating a cache operation like you have below or during shuffle
> operations in joins, etc. You might try to increase executor memory, tune
> shuffle operations, avoid caching, or reduce the size of your dataframe(s).
>
> Jack
>
> On Sep 8, 2023 at 12:43:07, Nebi Aydin 
> wrote:
>
>>
>> Sure
>> df = spark.read.option("basePath",
>> some_path).parquet(*list_of_s3_file_paths())
>> (
>> df
>> .where(SOME FILTER)
>> .repartition(6)
>> .cache()
>> )
>>
>> On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:
>>
>>> Hi Nebi, can you share the code you’re using to read and write from S3?
>>>
>>> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
>>> wrote:
>>>
>>>> Hi all,
>>>> I am using spark on EMR to process data. Basically i read data from AWS
>>>> S3 and do the transformation and post transformation i am loading/writing
>>>> data to s3.
>>>>
>>>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too
>>>> high.
>>>>
>>>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>>>
>>>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>>>> creating blocks and writing data into it. We are going all the operations
>>>> in memory.
>>>>
>>>> Any specific operation writing data to datanode(HDFS)?
>>>>
>>>> Here is the hdfs dirs created.
>>>>
>>>> ```
>>>>
>>>> *15.4G
>>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>>>
>>>> 129G
>>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>>>
>>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>>>
>>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>>>
>>>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>>>
>>>> ```
>>>>
>>>>
>>>> <https://stackoverflow.com/collectives/aws>
>>>>
>>>


Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Sure
df = spark.read.option("basePath",
some_path).parquet(*list_of_s3_file_paths())
(
df
.where(SOME FILTER)
.repartition(6)
.cache()
)

On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:

> Hi Nebi, can you share the code you’re using to read and write from S3?
>
> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
> wrote:
>
>> Hi all,
>> I am using spark on EMR to process data. Basically i read data from AWS
>> S3 and do the transformation and post transformation i am loading/writing
>> data to s3.
>>
>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.
>>
>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>
>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>> creating blocks and writing data into it. We are going all the operations
>> in memory.
>>
>> Any specific operation writing data to datanode(HDFS)?
>>
>> Here is the hdfs dirs created.
>>
>> ```
>>
>> *15.4G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>
>> 129G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>
>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>
>> ```
>>
>>
>> <https://stackoverflow.com/collectives/aws>
>>
>


About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Hi all,
I am using spark on EMR to process data. Basically i read data from AWS S3
and do the transformation and post transformation i am loading/writing data
to s3.

Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.

I disabled `yarn.log-aggregation-enable` by setting it to False.

I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
creating blocks and writing data into it. We are going all the operations
in memory.

Any specific operation writing data to datanode(HDFS)?

Here is the hdfs dirs created.

```

*15.4G
/mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1

129G
/mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812

129G /mnt/hdfs/current 129G /mnt/hdfs*

```





Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Nebi Aydin
kedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 23:30, Nebi Aydin  wrote:
>
>>
>> Hi, sorry for duplicates. First time user :)
>> I keep getting fetchfailedexception 7337 port closed. Which is external
>> shuffle service port.
>> I was trying to tune these parameters.
>> I have around 1000 executors and 5000 cores.
>> I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set 
>> spark.shuffle.io.clientThreads
>> to 2000?
>> Does shuffle client threads allow one executor to fetch from multiple
>> nodes shuffle service?
>>
>> Thanks
>> On Fri, Aug 18, 2023 at 17:42 Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> These two threads that you sent seem to be duplicates of each other?
>>>
>>> Anyhow I trust that you are familiar with the concept of shuffle in
>>> Spark. Spark Shuffle is an expensive operation since it involves the
>>> following
>>>
>>>-
>>>
>>>Disk I/O
>>>-
>>>
>>>Involves data serialization and deserialization
>>>-
>>>
>>>Network I/O
>>>
>>> Basically these are based on the concept of map/reduce in Spark and
>>> these parameters you posted relate to various aspects of threading and
>>> concurrency.
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Aug 2023 at 20:39, Nebi Aydin 
>>> wrote:
>>>
>>>>
>>>> I want to learn differences among below thread configurations.
>>>>
>>>> spark.shuffle.io.serverThreads
>>>> spark.shuffle.io.clientThreads
>>>> spark.shuffle.io.threads
>>>> spark.rpc.io.serverThreads
>>>> spark.rpc.io.clientThreads
>>>> spark.rpc.io.threads
>>>>
>>>> Thanks.
>>>>
>>>


Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
Hi, sorry for duplicates. First time user :)
I keep getting fetchfailedexception 7337 port closed. Which is external
shuffle service port.
I was trying to tune these parameters.
I have around 1000 executors and 5000 cores.
I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set
spark.shuffle.io.clientThreads
to 2000?
Does shuffle client threads allow one executor to fetch from multiple nodes
shuffle service?

Thanks
On Fri, Aug 18, 2023 at 17:42 Mich Talebzadeh 
wrote:

> Hi,
>
> These two threads that you sent seem to be duplicates of each other?
>
> Anyhow I trust that you are familiar with the concept of shuffle in Spark.
> Spark Shuffle is an expensive operation since it involves the following
>
>-
>
>Disk I/O
>-
>
>Involves data serialization and deserialization
>-
>
>Network I/O
>
> Basically these are based on the concept of map/reduce in Spark and these
> parameters you posted relate to various aspects of threading and
> concurrency.
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 20:39, Nebi Aydin 
> wrote:
>
>>
>> I want to learn differences among below thread configurations.
>>
>> spark.shuffle.io.serverThreads
>> spark.shuffle.io.clientThreads
>> spark.shuffle.io.threads
>> spark.rpc.io.serverThreads
>> spark.rpc.io.clientThreads
>> spark.rpc.io.threads
>>
>> Thanks.
>>
>


[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations.

spark.shuffle.io.serverThreads
spark.shuffle.io.clientThreads
spark.shuffle.io.threads
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.io.threads

Thanks.


[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations.

spark.shuffle.io.serverThreads
spark.shuffle.io.clientThreads
spark.shuffle.io.threads
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.io.threads

Thanks.