Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Sean Owen
NFS is a simple option for this kind of usage, yes. But --files is making N copies of the data - you may not want to do that for large data, or for data that you need to mutate. On Wed, Nov 25, 2020 at 9:16 PM Artemis User wrote: > Ah, I almost forgot that there is an even easier solution for

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User
Ah, I almost forgot that there is an even easier solution for your problem, namely to use the --files option in spark-submit. Usage as follows: --files FILES   Comma-separated list of files to be placed in the working   directory of each executor. File paths

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User
This is a typical file sharing problem in Spark.  Just setting up HDFS won't solve the problem unless you make your local machine as part of the cluster.  Spark server doesn't share files with your local machine without mounting drives to each other.  The best/easiest way to share the data

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Jeff Evans
In your situation, I'd try to do one of the following (in decreasing order of personal preference) 1. Restructure things so that you can operate on a local data file, at least for the purpose of developing your driver logic. Don't rely on the Metastore or HDFS until you have to.

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
A key part of what I'm trying to do involves NOT having to bring the data "through" the driver in order to get the cluster to work on it (which would involve a network hop from server to laptop and another from laptop to server). I'd rather have the data stay on the server and the driver stay on

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Chris Coutinho
I'm also curious if this is possible, so while I can't offer a solution maybe you could try the following. The driver and executor nodes need to have access to the same (distributed) file system, so you could try to mount the file system to your laptop, locally, and then try to submit jobs and/or

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Thanks Apostolos, I'm trying to avoid standing up HDFS just for this use case (single node). -Ryan On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos < papad...@csd.auth.gr> wrote: > Hi Ryan, > > since the driver is at your laptop, in order to access a remote file you > need to specify

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Apostolos N. Papadopoulos
Hi Ryan, since the driver is at your laptop, in order to access a remote file you need to specify the url for this I guess. For example, when I am using Spark over HDFS I specify the file like hdfs://blablabla which contains the url where namenode can answer. I believe that something

Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Hello! I have been tearing my hair out trying to solve this problem. Here is my setup: 1. I have Spark running on a server in standalone mode with data on the filesystem of the server itself (/opt/data/). 2. I have an instance of a Hive Metastore server running (backed by MariaDB) on the same