Re: Running the driver on a laptop but data is on the Spark server

Jeff Evans Wed, 25 Nov 2020 07:40:41 -0800

In your situation, I'd try to do one of the following (in decreasing order
of personal preference)


   1. Restructure things so that you can operate on a local data file, at
   least for the purpose of developing your driver logic.  Don't rely on the
   Metastore or HDFS until you have to.  Structure the application logic so it
   operates on a DataFrame (or Dataset) and doesn't care where it came
   from.  Build this data file from your real data (probably a small subset).
   2. Develop the logic using spark-shell running on a cluster node, since
   the environment will be all set up already (which, of course, you already
   mentioned).
   3. Set up remote debugging of the driver, open an SSH tunnel to the
   node, and connect from your local laptop to debug/iterate.  Figure out the
   fastest way to rebuild the jar and scp it up to try again.


On Wed, Nov 25, 2020 at 9:35 AM Ryan Victory <rvict...@gmail.com> wrote:

> A key part of what I'm trying to do involves NOT having to bring the data
> "through" the driver in order to get the cluster to work on it (which would
> involve a network hop from server to laptop and another from laptop to
> server). I'd rather have the data stay on the server and the driver stay on
> my laptop if possible, but I'm guessing the Spark APIs/topology wasn't
> designed that way.
>
> What I was hoping for was some way to be able to say val df =
> spark.sql("SELECT * FROM parquet.`*local://*/opt/data/transactions.parquet`")
> or similar to convince Spark to not move the data. I'd imagine if I used
> HDFS, data locality would kick in anyways to prevent the network shuffles
> between the driver and the cluster, but even then I wonder (based on what
> you guys are saying) if I'm wrong.
>
> Perhaps I'll just have to modify the workflow to move the JAR to the
> server and execute it from there. This isn't ideal but it's better than
> nothing.
>
> -Ryan
>
> On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho <chrisbcouti...@gmail.com>
> wrote:
>
>> I'm also curious if this is possible, so while I can't offer a solution
>> maybe you could try the following.
>>
>> The driver and executor nodes need to have access to the same
>> (distributed) file system, so you could try to mount the file system to
>> your laptop, locally, and then try to submit jobs and/or use the
>> spark-shell while connected to the same system.
>>
>> A quick google search led me to find this article where someone shows how
>> to mount an HDFS locally. It appears that Cloudera supports some kind of
>> FUSE-based library, which may be useful for your use-case.
>>
>> https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/
>>
>> On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
>>
>> Hello!
>>
>> I have been tearing my hair out trying to solve this problem. Here is my
>> setup:
>>
>> 1. I have Spark running on a server in standalone mode with data on the
>> filesystem of the server itself (/opt/data/).
>> 2. I have an instance of a Hive Metastore server running (backed by
>> MariaDB) on the same server
>> 3. I have a laptop where I am developing my spark jobs (Scala)
>>
>> I have configured Spark to use the metastore and set the warehouse
>> directory to be in /opt/data/warehouse/. What I am trying to accomplish are
>> a couple of things:
>>
>> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
>> have the driver run on my local machine (my laptop). I want the jobs to use
>> the data ON THE SERVER and not try to reference it from my local machine.
>> If I do something like this:
>>
>> val df = spark.sql("SELECT * FROM
>> parquet.`/opt/data/transactions.parquet`")
>>
>> I get an error that the path doesn't exist (because it's trying to find
>> it on my laptop). If I run the same thing in a spark-shell on the spark
>> server itself, there isn't an issue because the driver has access to the
>> data. If I submit the job with submit-mode=cluster then it works too
>> because the driver is on the cluster. I don't want this, I want to get the
>> results on my laptop.
>>
>> How can I force Spark to read the data from the cluster's filesystem and
>> not the driver's?
>>
>> 2. I have setup a Hive Metastore and created a table (in the spark shell
>> on the spark server itself). The data in the warehouse is in the local
>> filesystem. When I create a spark application JAR and try to run it from my
>> laptop, I get the same problem as #1, namely that it tries to find the
>> warehouse directory on my laptop itself.
>>
>> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
>> insights are much appreciated!
>>
>> -Ryan Victory
>>
>>
>>

Re: Running the driver on a laptop but data is on the Spark server

Reply via email to