Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
Hi, sorry for duplicates. First time user :)
I keep getting fetchfailedexception 7337 port closed. Which is external
shuffle service port.
I was trying to tune these parameters.
I have around 1000 executors and 5000 cores.
I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set
spark.shuffle.io.clientThreads
to 2000?
Does shuffle client threads allow one executor to fetch from multiple nodes
shuffle service?

Thanks
On Fri, Aug 18, 2023 at 17:42 Mich Talebzadeh 
wrote:

> Hi,
>
> These two threads that you sent seem to be duplicates of each other?
>
> Anyhow I trust that you are familiar with the concept of shuffle in Spark.
> Spark Shuffle is an expensive operation since it involves the following
>
>-
>
>Disk I/O
>-
>
>Involves data serialization and deserialization
>-
>
>Network I/O
>
> Basically these are based on the concept of map/reduce in Spark and these
> parameters you posted relate to various aspects of threading and
> concurrency.
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 20:39, Nebi Aydin 
> wrote:
>
>>
>> I want to learn differences among below thread configurations.
>>
>> spark.shuffle.io.serverThreads
>> spark.shuffle.io.clientThreads
>> spark.shuffle.io.threads
>> spark.rpc.io.serverThreads
>> spark.rpc.io.clientThreads
>> spark.rpc.io.threads
>>
>> Thanks.
>>
>


Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Mich Talebzadeh
Hi,

These two threads that you sent seem to be duplicates of each other?

Anyhow I trust that you are familiar with the concept of shuffle in Spark.
Spark Shuffle is an expensive operation since it involves the following

   -

   Disk I/O
   -

   Involves data serialization and deserialization
   -

   Network I/O

Basically these are based on the concept of map/reduce in Spark and these
parameters you posted relate to various aspects of threading and
concurrency.

HTH


Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Aug 2023 at 20:39, Nebi Aydin 
wrote:

>
> I want to learn differences among below thread configurations.
>
> spark.shuffle.io.serverThreads
> spark.shuffle.io.clientThreads
> spark.shuffle.io.threads
> spark.rpc.io.serverThreads
> spark.rpc.io.clientThreads
> spark.rpc.io.threads
>
> Thanks.
>


[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations.

spark.shuffle.io.serverThreads
spark.shuffle.io.clientThreads
spark.shuffle.io.threads
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.io.threads

Thanks.


[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations.

spark.shuffle.io.serverThreads
spark.shuffle.io.clientThreads
spark.shuffle.io.threads
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.io.threads

Thanks.


[no subject]

2023-08-18 Thread Dipayan Dev
Unsubscribe --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
*
M.Tech (AI), IISc, Bangalore


Re: read dataset from only one node in YARN cluster

2023-08-18 Thread Mich Talebzadeh
Hi,

Where do you see this? In spark UI.

So data is skewed most probably as one node gets all the data and others
nothing as I understand?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Aug 2023 at 17:17, marc nicole  wrote:

> Hi,
>
> Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a
> dataset that is found in one node of the cluster and not in the others, how
> to tell Spark that?
>
> I expect through DataframeReader and using path like
> *IP:port/pathOnLocalNode*
>
> PS: loading the dataset in HDFS is not an option.
>
> Thanks
>


read dataset from only one node in YARN cluster

2023-08-18 Thread marc nicole
Hi,

Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a
dataset that is found in one node of the cluster and not in the others, how
to tell Spark that?

I expect through DataframeReader and using path like
*IP:port/pathOnLocalNode*

PS: loading the dataset in HDFS is not an option.

Thanks


RE: Re: Spark Vulnerabilities

2023-08-18 Thread Sankavi Nagalingam
Hi @Bjørn Jørgensen,

Thank you for your quick response.

Based on the PR shared , we are doing analysis from our side. For few jars you 
have requested for the CVE id, I have updated it in the attached document.
Kindly verify it from your side and revert us back.

Thanks,
Sankavi

From: Bjørn Jørgensen 
Sent: Monday, August 14, 2023 6:11 PM
To: Sankavi Nagalingam 
Cc: user@spark.apache.org; Vijaya Kumar Mathupaiyan 
Subject: [EXT MSG] Re: Spark Vulnerabilities

EXTERNAL source. Be CAREFUL with links / attachments

I have added links to the github PR. Or comment for those that I have not seen 
before.

Apache Spark has very many dependencies, some can easily be upgraded while 
others are very hard to fix.

Please feel free to open a PR if you wanna help.

man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam 
mailto:sankavi.nagalin...@temenos.com.invalid>>:
Hi Team,

We could see there are many dependent vulnerabilities present in the latest 
spark-core:3.4.1.jar. PFA
Could you please let us know when will be the fix version available for the 
users.

Thanks,
Sankavi


The information in this e-mail and any attachments is confidential and may be 
legally privileged. It is intended solely for the addressee or addressees. Any 
use or disclosure of the contents of this e-mail/attachments by a not intended 
recipient is unauthorized and may be unlawful. If you have received this e-mail 
in error please notify the sender. Please note that any views or opinions 
presented in this e-mail are solely those of the author and do not necessarily 
represent those of TEMENOS. We recommend that you check this e-mail and any 
attachments against viruses. TEMENOS accepts no liability for any damage caused 
by any malicious code or virus transmitted by this e-mail.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org


--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

The information in this e-mail and any attachments is confidential and may be 
legally privileged. It is intended solely for the addressee or addressees. Any 
use or disclosure of the contents of this e-mail/attachments by a not intended 
recipient is unauthorized and may be unlawful. If you have received this e-mail 
in error please notify the sender. Please note that any views or opinions 
presented in this e-mail are solely those of the author and do not necessarily 
represent those of TEMENOS. We recommend that you check this e-mail and any 
attachments against viruses. TEMENOS accepts no liability for any damage caused 
by any malicious code or virus transmitted by this e-mail.


Spark-3.4.1-Vulnerablities-spark team.xlsx
Description: Spark-3.4.1-Vulnerablities-spark team.xlsx

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and
4GB. So I suggest that you leave it as it is.

I have not used the standalone mode since spark-2.4.3 so I may be missing a
fair bit of context here.  I am sure there are others like you that are
still using it!

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 17 Aug 2023 at 23:33, Patrick Tucci  wrote:

> No, the driver memory was not set explicitly. So it was likely the default
> value, which appears to be 1GB.
>
> On Thu, Aug 17, 2023, 16:49 Mich Talebzadeh 
> wrote:
>
>> One question, what was the driver memory before setting it to 4G? Did you
>> have it set at all before?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 17 Aug 2023 at 21:01, Patrick Tucci 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Here are my config values from spark-defaults.conf:
>>>
>>> spark.eventLog.enabled true
>>> spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs
>>> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
>>> spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs
>>> spark.history.fs.update.interval 10s
>>> spark.history.ui.port 18080
>>> spark.sql.warehouse.dir hdfs://10.0.50.1:8020/user/spark/warehouse
>>> spark.executor.cores 4
>>> spark.executor.memory 16000M
>>> spark.sql.legacy.createHiveTableByDefault false
>>> spark.driver.host 10.0.50.1
>>> spark.scheduler.mode FAIR
>>> spark.driver.memory 4g #added 2023-08-17
>>>
>>> The only application that runs on the cluster is the Spark Thrift
>>> server, which I launch like so:
>>>
>>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077
>>>
>>> The cluster runs in standalone mode and does not use Yarn for resource
>>> management. As a result, the Spark Thrift server acquires all available
>>> cluster resources when it starts. This is okay; as of right now, I am the
>>> only user of the cluster. If I add more users, they will also be SQL users,
>>> submitting queries through the Thrift server.
>>>
>>> Let me know if you have any other questions or thoughts.
>>>
>>> Thanks,
>>>
>>> Patrick
>>>
>>> On Thu, Aug 17, 2023 at 3:09 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hello Paatrick,

 As a matter of interest what parameters and their respective values do
 you use in spark-submit. I assume it is running in YARN mode.

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 17 Aug 2023 at 19:36, Patrick Tucci 
 wrote:

> Hi Mich,
>
> Yes, that's the sequence of events. I think the big breakthrough is
> that (for now at least) Spark is throwing errors instead of the queries
> hanging. Which is a big step forward. I can at least troubleshoot issues 
> if
> I know what they are.
>
> When I reflect on the issues I faced and the solutions, my issue may
> have been driver memory all along. I just couldn't determine that was the
> issue because I never saw any errors. In one case, converting a LEFT JOIN
> to an inner JOIN caused the query to run. In another case, replacing a 
> text
> field with an int ID and JOINing on the ID column worked. Per your advice,
> changing file formats from ORC to Parquet solved one issue. These
> interventions could have changed the