[Spark Core]: How does rpc threads influence shuffle?

2023-09-15 Thread Nebi Aydin
Hello all, I know that these parameters exist for shuffle tuning: *spark.shuffle.io.serverThreadsspark.shuffle.io.clientThreadsspark.shuffle.io.threads* But we also have *spark.rpc.io.serverThreadsspark.rpc.io.clientThreadsspark.rpc.io.threads* So specifically talking about *Shuffling,

Re: Spark stand-alone mode

2023-09-15 Thread Bjørn Jørgensen
you need to setup ssh without password, use key instead. How to connect without password using SSH (passwordless) fre. 15. sep. 2023 kl. 20:55 skrev Mich Talebzadeh <

Re: Filter out 20% of rows

2023-09-15 Thread Bjørn Jørgensen
Something like this? # Standard library imports import json import multiprocessing import os import re import sys import random # Third-party imports import numpy as np import pandas as pd import pyarrow # Pyspark imports from pyspark import SparkConf, SparkContext from pyspark.sql import

Re: Spark stand-alone mode

2023-09-15 Thread Mich Talebzadeh
Hi, Can these 4 nodes talk to each other through ssh as trusted hosts (on top of the network that Sean already mentioned)? Otherwise you need to set it up. You can install a LAN if you have another free port at the back of your HPC nodes. They should You ought to try to set up a Hadoop cluster

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen
Yes, should work fine, just set up according to the docs. There needs to be network connectivity between whatever the driver node is and these 4 nodes. On Thu, Sep 14, 2023 at 11:57 PM Ilango wrote: > > Hi all, > > We have 4 HPC nodes and installed spark individually in all nodes. > > Spark is

Re: Spark stand-alone mode

2023-09-15 Thread Patrick Tucci
I use Spark in standalone mode. It works well, and the instructions on the site are accurate for the most part. The only thing that didn't work for me was the start_all.sh script. Instead, I use a simple script that starts the master node, then uses SSH to connect to the worker machines and start