Re: [Spark-Submit] Where to store data files while running job in cluster mode?
If you're running in a clustered mode you need to copy the file across all the nodes of same shared file system. 1) put it into a distributed filesystem as HDFS or via (s)ftp 2) you have to transfer /sftp the file into the worker node before running the Spark job and then you have to put as an argument of textFile the path of the file in the worker filesystem. Regards, Vaquar khan On Fri, Sep 29, 2017 at 2:00 PM, JG Perrin wrote: > On a test system, you can also use something like > Owncloud/Nextcloud/Dropbox to insure that the files are synchronized. Would > not do it for TB of data ;) ... > > -Original Message- > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Friday, September 29, 2017 5:14 AM > To: Gaurav1809 > Cc: user@spark.apache.org > Subject: Re: [Spark-Submit] Where to store data files while running job in > cluster mode? > > You should use a distributed filesystem such as HDFS. If you want to use > the local filesystem then you have to copy each file to each node. > > > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: > > > > Hi All, > > > > I have multi node architecture of (1 master,2 workers) Spark cluster, > > the job runs to read CSV file data and it works fine when run on local > > mode (Local(*)). > > However, when the same job is ran in cluster mode(Spark://HOST:PORT), > > it is not able to read it. > > I want to know how to reference the files Or where to store them? > > Currently the CSV data file is on master(from where the job is > submitted). > > > > Following code works fine in local mode but not in cluster mode. > > > > val spark = SparkSession > > .builder() > > .appName("SampleFlightsApp") > > .master("spark://masterIP:7077") // change it to > > .master("local[*]) for local mode > > .getOrCreate() > > > >val flightDF = > > spark.read.option("header",true).csv("/home/username/sampleflightdata") > >flightDF.printSchema() > > > > Error: FileNotFoundException: File > > file:/home/username/sampleflightdata does not exist > > > > > > > > -- > > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Regards, Vaquar Khan +1 -224-436-0783 Greater Chicago
RE: [Spark-Submit] Where to store data files while running job in cluster mode?
On a test system, you can also use something like Owncloud/Nextcloud/Dropbox to insure that the files are synchronized. Would not do it for TB of data ;) ... -Original Message- From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Friday, September 29, 2017 5:14 AM To: Gaurav1809 Cc: user@spark.apache.org Subject: Re: [Spark-Submit] Where to store data files while running job in cluster mode? You should use a distributed filesystem such as HDFS. If you want to use the local filesystem then you have to copy each file to each node. > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: > > Hi All, > > I have multi node architecture of (1 master,2 workers) Spark cluster, > the job runs to read CSV file data and it works fine when run on local > mode (Local(*)). > However, when the same job is ran in cluster mode(Spark://HOST:PORT), > it is not able to read it. > I want to know how to reference the files Or where to store them? > Currently the CSV data file is on master(from where the job is submitted). > > Following code works fine in local mode but not in cluster mode. > > val spark = SparkSession > .builder() > .appName("SampleFlightsApp") > .master("spark://masterIP:7077") // change it to > .master("local[*]) for local mode > .getOrCreate() > >val flightDF = > spark.read.option("header",true).csv("/home/username/sampleflightdata") >flightDF.printSchema() > > Error: FileNotFoundException: File > file:/home/username/sampleflightdata does not exist > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Spark-Submit] Where to store data files while running job in cluster mode?
Try tachyon.. its less fuss On Fri, 29 Sep 2017 at 8:32 PM lucas.g...@gmail.com wrote: > We use S3, there are caveats and issues with that but it can be made to > work. > > If interested let me know and I'll show you our workarounds. I wouldn't > do it naively though, there's lots of potential problems. If you already > have HDFS use that, otherwise all things told it's probably less effort to > use S3. > > Gary > > On 29 September 2017 at 05:03, Arun Rai wrote: > >> Or you can try mounting that drive to all node. >> >> On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke wrote: >> >>> You should use a distributed filesystem such as HDFS. If you want to use >>> the local filesystem then you have to copy each file to each node. >>> >>> >>> >>> >>> >>> > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: >>> >>> >>> > >>> >>> >>> > Hi All, >>> >>> >>> > >>> >>> >>> > I have multi node architecture of (1 master,2 workers) Spark cluster, >>> the >>> >>> >>> > job runs to read CSV file data and it works fine when run on local mode >>> >>> >>> > (Local(*)). >>> >>> >>> > However, when the same job is ran in cluster mode(Spark://HOST:PORT), >>> it is >>> >>> >>> > not able to read it. >>> >>> >>> > I want to know how to reference the files Or where to store them? >>> Currently >>> >>> >>> > the CSV data file is on master(from where the job is submitted). >>> >>> >>> > >>> >>> >>> > Following code works fine in local mode but not in cluster mode. >>> >>> >>> > >>> >>> >>> > val spark = SparkSession >>> >>> >>> > .builder() >>> >>> >>> > .appName("SampleFlightsApp") >>> >>> >>> > .master("spark://masterIP:7077") // change it to >>> .master("local[*]) >>> >>> >>> > for local mode >>> >>> >>> > .getOrCreate() >>> >>> >>> > >>> >>> >>> >val flightDF = >>> >>> >>> > spark.read.option("header",true).csv("/home/username/sampleflightdata") >>> >>> >>> >flightDF.printSchema() >>> >>> >>> > >>> >>> >>> > Error: FileNotFoundException: File >>> file:/home/username/sampleflightdata does >>> >>> >>> > not exist >>> >>> >>> > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> > -- >>> >>> >>> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> >>> > >>> >>> >>> > - >>> >>> >>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> > >>> >>> >>> >>> >>> >>> - >>> >>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> >>> >>> >>> >> >> > > > -- Sent from Gmail Mobile
Re: [Spark-Submit] Where to store data files while running job in cluster mode?
We use S3, there are caveats and issues with that but it can be made to work. If interested let me know and I'll show you our workarounds. I wouldn't do it naively though, there's lots of potential problems. If you already have HDFS use that, otherwise all things told it's probably less effort to use S3. Gary On 29 September 2017 at 05:03, Arun Rai wrote: > Or you can try mounting that drive to all node. > > On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke wrote: > >> You should use a distributed filesystem such as HDFS. If you want to use >> the local filesystem then you have to copy each file to each node. >> >> > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: >> > >> > Hi All, >> > >> > I have multi node architecture of (1 master,2 workers) Spark cluster, >> the >> > job runs to read CSV file data and it works fine when run on local mode >> > (Local(*)). >> > However, when the same job is ran in cluster mode(Spark://HOST:PORT), >> it is >> > not able to read it. >> > I want to know how to reference the files Or where to store them? >> Currently >> > the CSV data file is on master(from where the job is submitted). >> > >> > Following code works fine in local mode but not in cluster mode. >> > >> > val spark = SparkSession >> > .builder() >> > .appName("SampleFlightsApp") >> > .master("spark://masterIP:7077") // change it to >> .master("local[*]) >> > for local mode >> > .getOrCreate() >> > >> >val flightDF = >> > spark.read.option("header",true).csv("/home/username/sampleflightdata") >> >flightDF.printSchema() >> > >> > Error: FileNotFoundException: File file:/home/username/sampleflightdata >> does >> > not exist >> > >> > >> > >> > -- >> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> > >> > - >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Re: [Spark-Submit] Where to store data files while running job in cluster mode?
Yes you need to store the file at a location where it is equally retrievable ("same path") for the master and all nodes in the cluster. A simple solution (apart from a HDFS) that does not scale to well but might be a OK with only 3 nodes like in your configuration is a network accessible storage (a NAS or a shared folder for example). hope this helps Alexander On Fri, Sep 29, 2017 at 12:05 PM, Sathishkumar Manimoorthy < mrsathishkuma...@gmail.com> wrote: > Place it in HDFS and give the reference path in your code. > > Thanks, > Sathish > > On Fri, Sep 29, 2017 at 3:31 PM, Gaurav1809 > wrote: > >> Hi All, >> >> I have multi node architecture of (1 master,2 workers) Spark cluster, the >> job runs to read CSV file data and it works fine when run on local mode >> (Local(*)). However, when the same job is ran in cluster mode >> (Spark://HOST:PORT), it is not able to read it. I want to know how to >> reference the files Or where to store them? Currently the CSV data file is >> on master(from where the job is submitted). >> >> Following code works fine in local mode but not in cluster mode. >> >> val spark = SparkSession >> .builder() >> .appName("SampleFlightsApp") >> .master("spark://masterIP:7077") // change it to .master("local[*]) >> for local mode >> .getOrCreate() >> >> val flightDF = >> spark.read.option("header",true).csv("/home/username/sampleflightdata") >> flightDF.printSchema() >> >> Error: FileNotFoundException: File file:/home/gaurav/sampleflightdata >> does >> not exist >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: [Spark-Submit] Where to store data files while running job in cluster mode?
Or you can try mounting that drive to all node. On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke wrote: > You should use a distributed filesystem such as HDFS. If you want to use > the local filesystem then you have to copy each file to each node. > > > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: > > > > Hi All, > > > > I have multi node architecture of (1 master,2 workers) Spark cluster, the > > job runs to read CSV file data and it works fine when run on local mode > > (Local(*)). > > However, when the same job is ran in cluster mode(Spark://HOST:PORT), it > is > > not able to read it. > > I want to know how to reference the files Or where to store them? > Currently > > the CSV data file is on master(from where the job is submitted). > > > > Following code works fine in local mode but not in cluster mode. > > > > val spark = SparkSession > > .builder() > > .appName("SampleFlightsApp") > > .master("spark://masterIP:7077") // change it to .master("local[*]) > > for local mode > > .getOrCreate() > > > >val flightDF = > > spark.read.option("header",true).csv("/home/username/sampleflightdata") > >flightDF.printSchema() > > > > Error: FileNotFoundException: File file:/home/username/sampleflightdata > does > > not exist > > > > > > > > -- > > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: [Spark-Submit] Where to store data files while running job in cluster mode?
You should use a distributed filesystem such as HDFS. If you want to use the local filesystem then you have to copy each file to each node. > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: > > Hi All, > > I have multi node architecture of (1 master,2 workers) Spark cluster, the > job runs to read CSV file data and it works fine when run on local mode > (Local(*)). > However, when the same job is ran in cluster mode(Spark://HOST:PORT), it is > not able to read it. > I want to know how to reference the files Or where to store them? Currently > the CSV data file is on master(from where the job is submitted). > > Following code works fine in local mode but not in cluster mode. > > val spark = SparkSession > .builder() > .appName("SampleFlightsApp") > .master("spark://masterIP:7077") // change it to .master("local[*]) > for local mode > .getOrCreate() > >val flightDF = > spark.read.option("header",true).csv("/home/username/sampleflightdata") >flightDF.printSchema() > > Error: FileNotFoundException: File file:/home/username/sampleflightdata does > not exist > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Spark-Submit] Where to store data files while running job in cluster mode?
Place it in HDFS and give the reference path in your code. Thanks, Sathish On Fri, Sep 29, 2017 at 3:31 PM, Gaurav1809 wrote: > Hi All, > > I have multi node architecture of (1 master,2 workers) Spark cluster, the > job runs to read CSV file data and it works fine when run on local mode > (Local(*)). However, when the same job is ran in cluster mode > (Spark://HOST:PORT), it is not able to read it. I want to know how to > reference the files Or where to store them? Currently the CSV data file is > on master(from where the job is submitted). > > Following code works fine in local mode but not in cluster mode. > > val spark = SparkSession > .builder() > .appName("SampleFlightsApp") > .master("spark://masterIP:7077") // change it to .master("local[*]) > for local mode > .getOrCreate() > > val flightDF = > spark.read.option("header",true).csv("/home/username/sampleflightdata") > flightDF.printSchema() > > Error: FileNotFoundException: File file:/home/gaurav/sampleflightdata does > not exist > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >