Re: Running the driver on a laptop but data is on the Spark server

Artemis User Wed, 25 Nov 2020 18:53:19 -0800

This is a typical file sharing problem in Spark. Just setting up HDFSwon't solve the problem unless you make your local machine as part ofthe cluster. Spark server doesn't share files with your local machinewithout mounting drives to each other. The best/easiest way to sharethe data between your local machine and the Spark server machine is touse NFS (as Spark manual suggests). You can use a common NFS server andmount /opt/data drive on both local and the server machine, or run NFSon either machine and mount the /opt/data on the other. Regardless, youhave to ensure that /opt/data on both local and server machine arepointing to the save physical drive. Also don't forget to relax theread/write permissions for all on the drive or map the user ID on bothmachines.

Using Fuse may be an option on Mac, but NFS is the standard solution forthis type of problem (Mac supports NFS as well).


-- ND

On 11/25/20 10:34 AM, Ryan Victory wrote:

A key part of what I'm trying to do involves NOT having to bring thedata "through" the driver in order to get the cluster to work on it(which would involve a network hop from server to laptop and anotherfrom laptop to server). I'd rather have the data stay on the serverand the driver stay on my laptop if possible, but I'm guessing theSpark APIs/topology wasn't designed that way.

What I was hoping for was some way to be able to say val df =spark.sql("SELECT * FROMparquet.`*local://*/opt/data/transactions.parquet`") or similar toconvince Spark to not move the data. I'd imagine if I used HDFS, datalocality would kick in anyways to prevent the network shuffles betweenthe driver and the cluster, but even then I wonder (based on what youguys are saying) if I'm wrong.

Perhaps I'll just have to modify the workflow to move the JAR to theserver and execute it from there. This isn't ideal but it's betterthan nothing.


-Ryan

On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho<chrisbcouti...@gmail.com <mailto:chrisbcouti...@gmail.com>> wrote:


    I'm also curious if this is possible, so while I can't offer a
    solution maybe you could try the following.

    The driver and executor nodes need to have access to the same
    (distributed) file system, so you could try to mount the file
    system to your laptop, locally, and then try to submit jobs and/or
    use the spark-shell while connected to the same system.

    A quick google search led me to find this article where someone
    shows how to mount an HDFS locally. It appears that Cloudera
    supports some kind of FUSE-based library, which may be useful for
    your use-case.

    https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/
    <https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/>

    On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:

    Hello!

    I have been tearing my hair out trying to solve this problem.
    Here is my setup:

    1. I have Spark running on a server in standalone mode with data
    on the filesystem of the server itself (/opt/data/).
    2. I have an instance of a Hive Metastore server running (backed
    by MariaDB) on the same server
    3. I have a laptop where I am developing my spark jobs (Scala)

    I have configured Spark to use the metastore and set the
    warehouse directory to be in /opt/data/warehouse/. What I am
    trying to accomplish are a couple of things:

    1. I am trying to submit Spark jobs (via JARs) using
    spark-submit, but have the driver run on my local machine (my
    laptop). I want the jobs to use the data ON THE SERVER and not
    try to reference it from my local machine. If I do something like
    this:

    val df = spark.sql("SELECT * FROM
    parquet.`/opt/data/transactions.parquet`")

    I get an error that the path doesn't exist (because it's trying
    to find it on my laptop). If I run the same thing in a
    spark-shell on the spark server itself, there isn't an issue
    because the driver has access to the data. If I submit the job
    with submit-mode=cluster then it works too because the driver is
    on the cluster. I don't want this, I want to get the results on
    my laptop.

    How can I force Spark to read the data from the cluster's
    filesystem and not the driver's?

    2. I have setup a Hive Metastore and created a table (in the
    spark shell on the spark server itself). The data in the
    warehouse is in the local filesystem. When I create a spark
    application JAR and try to run it from my laptop, I get the same
    problem as #1, namely that it tries to find the warehouse
    directory on my laptop itself.

    Am I crazy? Perhaps this isn't a supported way to use Spark? Any
    help or insights are much appreciated!

    -Ryan Victory

Re: Running the driver on a laptop but data is on the Spark server

Reply via email to