Hi all, Thanks a lot for your suggestions and knowledge sharing. I like to let you know that, I completed setting up the stand alone cluster and couple of data science users are able to use it already for last two weeks. And the performance is really good. Almost 10X performance improvement compare to HPC local mode. They tested with some complex data science scripts using spark and other data science projects. The cluster is really stable and very performant.
I enabled dynamic allocation and cap the memory and cpu accordingly at spark-defaults. Conf and at our spark framework code. So its been pretty impressive for the last few weeks. Thanks you so much! Thanks, Elango On Tue, 19 Sep 2023 at 6:40 PM, Patrick Tucci <patrick.tu...@gmail.com> wrote: > Multiple applications can run at once, but you need to either configure > Spark or your applications to allow that. In stand-alone mode, each > application attempts to take all resources available by default. This > section of the documentation has more details: > > > https://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling > > Explicitly setting the resources per application limits the resources to > the configured values for the lifetime of the application. You can use > dynamic allocation to allow Spark to scale the resources up and down per > application based on load, but the configuration is relatively more complex: > > > https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > On Mon, Sep 18, 2023 at 3:53 PM Ilango <elango...@gmail.com> wrote: > >> >> Thanks all for your suggestions. Noted with thanks. >> Just wanted share few more details about the environment >> 1. We use NFS for data storage and data is in parquet format >> 2. All HPC nodes are connected and already work as a cluster for Studio >> workbench. I can setup password less SSH if it not exist already. >> 3. We will stick with NFS for now and stand alone then may be will >> explore HDFS and YARN. >> >> Can you please confirm whether multiple users can run spark jobs at the >> same time? >> If so I will start working on it and let you know how it goes >> >> Mich, the link to Hadoop is not working. Can you please check and let me >> know the correct link. Would like to explore Hadoop option as well. >> >> >> >> Thanks, >> Elango >> >> On Sat, Sep 16, 2023, 4:20 AM Bjørn Jørgensen <bjornjorgen...@gmail.com> >> wrote: >> >>> you need to setup ssh without password, use key instead. How to >>> connect without password using SSH (passwordless) >>> <https://levelup.gitconnected.com/how-to-connect-without-password-using-ssh-passwordless-9b8963c828e8> >>> >>> fre. 15. sep. 2023 kl. 20:55 skrev Mich Talebzadeh < >>> mich.talebza...@gmail.com>: >>> >>>> Hi, >>>> >>>> Can these 4 nodes talk to each other through ssh as trusted hosts (on >>>> top of the network that Sean already mentioned)? Otherwise you need to set >>>> it up. You can install a LAN if you have another free port at the back of >>>> your HPC nodes. They should >>>> >>>> You ought to try to set up a Hadoop cluster pretty easily. Check this >>>> old article of mine for Hadoop set-up. >>>> >>>> >>>> https://www.linkedin.com/pulse/diy-festive-season-how-install-configure-big-data-so-mich/?trackingId=z7n5tx7tQOGK9tcG9VClkw%3D%3D >>>> >>>> Hadoop will provide you with a common storage layer (HDFS) that these >>>> nodes will be able to share and talk. Yarn is your best bet as the resource >>>> manager with reasonably powerful hosts you have. However, for now the Stand >>>> Alone mode will do. Make sure that the Metastore you choose, (by default it >>>> will use Hive Metastore called Derby :( ) is something respetable like >>>> Postgres DB that can handle multiple concurrent spark jobs >>>> >>>> HTH >>>> >>>> >>>> Mich Talebzadeh, >>>> Distinguished Technologist, Solutions Architect & Engineer >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Fri, 15 Sept 2023 at 07:04, Ilango <elango...@gmail.com> wrote: >>>> >>>>> >>>>> Hi all, >>>>> >>>>> We have 4 HPC nodes and installed spark individually in all nodes. >>>>> >>>>> Spark is used as local mode(each driver/executor will have 8 cores and >>>>> 65 GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as >>>>> scheduler. >>>>> >>>>> As this is local mode, we are facing performance issue(as only one >>>>> executor) when it comes dealing with large datasets. >>>>> >>>>> Can I convert this 4 nodes into spark standalone cluster. We dont have >>>>> hadoop so yarn mode is out of scope. >>>>> >>>>> Shall I follow the official documentation for setting up standalone >>>>> cluster. Will it work? Do I need to aware anything else? >>>>> Can you please share your thoughts? >>>>> >>>>> Thanks, >>>>> Elango >>>>> >>>> >>> >>> -- >>> Bjørn Jørgensen >>> Vestre Aspehaug 4, 6010 Ålesund >>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>> Norge >>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>> >>> +47 480 94 297 >>> >>