Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread vaquar khan
If you're running in a clustered mode you need to copy the file across all
the nodes of same shared file system.

1) put it into a distributed filesystem as HDFS or via (s)ftp

2) you  have to transfer /sftp the file into the worker node before running
the Spark job and then you have to put as an argument of textFile the path
of the file in the worker filesystem.

Regards,
Vaquar khan

On Fri, Sep 29, 2017 at 2:00 PM, JG Perrin  wrote:

> On a test system, you can also use something like
> Owncloud/Nextcloud/Dropbox to insure that the files are synchronized. Would
> not do it for TB of data ;) ...
>
> -Original Message-
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: Friday, September 29, 2017 5:14 AM
> To: Gaurav1809 
> Cc: user@spark.apache.org
> Subject: Re: [Spark-Submit] Where to store data files while running job in
> cluster mode?
>
> You should use a distributed filesystem such as HDFS. If you want to use
> the local filesystem then you have to copy each file to each node.
>
> > On 29. Sep 2017, at 12:05, Gaurav1809  wrote:
> >
> > Hi All,
> >
> > I have multi node architecture of (1 master,2 workers) Spark cluster,
> > the job runs to read CSV file data and it works fine when run on local
> > mode (Local(*)).
> > However, when the same job is ran in cluster mode(Spark://HOST:PORT),
> > it is not able to read it.
> > I want to know how to reference the files Or where to store them?
> > Currently the CSV data file is on master(from where the job is
> submitted).
> >
> > Following code works fine in local mode but not in cluster mode.
> >
> > val spark = SparkSession
> >  .builder()
> >  .appName("SampleFlightsApp")
> >  .master("spark://masterIP:7077") // change it to
> > .master("local[*]) for local mode
> >  .getOrCreate()
> >
> >val flightDF =
> > spark.read.option("header",true).csv("/home/username/sampleflightdata")
> >flightDF.printSchema()
> >
> > Error: FileNotFoundException: File
> > file:/home/username/sampleflightdata does not exist
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


RE: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread JG Perrin
On a test system, you can also use something like Owncloud/Nextcloud/Dropbox to 
insure that the files are synchronized. Would not do it for TB of data ;) ...

-Original Message-
From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: Friday, September 29, 2017 5:14 AM
To: Gaurav1809 
Cc: user@spark.apache.org
Subject: Re: [Spark-Submit] Where to store data files while running job in 
cluster mode?

You should use a distributed filesystem such as HDFS. If you want to use the 
local filesystem then you have to copy each file to each node.

> On 29. Sep 2017, at 12:05, Gaurav1809  wrote:
> 
> Hi All,
> 
> I have multi node architecture of (1 master,2 workers) Spark cluster, 
> the job runs to read CSV file data and it works fine when run on local 
> mode (Local(*)).
> However, when the same job is ran in cluster mode(Spark://HOST:PORT), 
> it is not able to read it.
> I want to know how to reference the files Or where to store them? 
> Currently the CSV data file is on master(from where the job is submitted).
> 
> Following code works fine in local mode but not in cluster mode.
> 
> val spark = SparkSession
>  .builder()
>  .appName("SampleFlightsApp")
>  .master("spark://masterIP:7077") // change it to 
> .master("local[*]) for local mode
>  .getOrCreate()
> 
>val flightDF =
> spark.read.option("header",true).csv("/home/username/sampleflightdata")
>flightDF.printSchema()
> 
> Error: FileNotFoundException: File 
> file:/home/username/sampleflightdata does not exist
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Imran Rajjad
Try tachyon.. its less fuss


On Fri, 29 Sep 2017 at 8:32 PM lucas.g...@gmail.com 
wrote:

> We use S3, there are caveats and issues with that but it can be made to
> work.
>
> If interested let me know and I'll show you our workarounds.  I wouldn't
> do it naively though, there's lots of potential problems.  If you already
> have HDFS use that, otherwise all things told it's probably less effort to
> use S3.
>
> Gary
>
> On 29 September 2017 at 05:03, Arun Rai  wrote:
>
>> Or you can try mounting that drive to all node.
>>
>> On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke  wrote:
>>
>>> You should use a distributed filesystem such as HDFS. If you want to use
>>> the local filesystem then you have to copy each file to each node.
>>>
>>>
>>>
>>>
>>>
>>> > On 29. Sep 2017, at 12:05, Gaurav1809  wrote:
>>>
>>>
>>> >
>>>
>>>
>>> > Hi All,
>>>
>>>
>>> >
>>>
>>>
>>> > I have multi node architecture of (1 master,2 workers) Spark cluster,
>>> the
>>>
>>>
>>> > job runs to read CSV file data and it works fine when run on local mode
>>>
>>>
>>> > (Local(*)).
>>>
>>>
>>> > However, when the same job is ran in cluster mode(Spark://HOST:PORT),
>>> it is
>>>
>>>
>>> > not able to read it.
>>>
>>>
>>> > I want to know how to reference the files Or where to store them?
>>> Currently
>>>
>>>
>>> > the CSV data file is on master(from where the job is submitted).
>>>
>>>
>>> >
>>>
>>>
>>> > Following code works fine in local mode but not in cluster mode.
>>>
>>>
>>> >
>>>
>>>
>>> > val spark = SparkSession
>>>
>>>
>>> >  .builder()
>>>
>>>
>>> >  .appName("SampleFlightsApp")
>>>
>>>
>>> >  .master("spark://masterIP:7077") // change it to
>>> .master("local[*])
>>>
>>>
>>> > for local mode
>>>
>>>
>>> >  .getOrCreate()
>>>
>>>
>>> >
>>>
>>>
>>> >val flightDF =
>>>
>>>
>>> > spark.read.option("header",true).csv("/home/username/sampleflightdata")
>>>
>>>
>>> >flightDF.printSchema()
>>>
>>>
>>> >
>>>
>>>
>>> > Error: FileNotFoundException: File
>>> file:/home/username/sampleflightdata does
>>>
>>>
>>> > not exist
>>>
>>>
>>> >
>>>
>>>
>>> >
>>>
>>>
>>> >
>>>
>>>
>>> > --
>>>
>>>
>>> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>>
>>> >
>>>
>>>
>>> > -
>>>
>>>
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>> >
>>>
>>>
>>>
>>>
>>>
>>> -
>>>
>>>
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
Sent from Gmail Mobile


Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread lucas.g...@gmail.com
We use S3, there are caveats and issues with that but it can be made to
work.

If interested let me know and I'll show you our workarounds.  I wouldn't do
it naively though, there's lots of potential problems.  If you already have
HDFS use that, otherwise all things told it's probably less effort to use
S3.

Gary

On 29 September 2017 at 05:03, Arun Rai  wrote:

> Or you can try mounting that drive to all node.
>
> On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke  wrote:
>
>> You should use a distributed filesystem such as HDFS. If you want to use
>> the local filesystem then you have to copy each file to each node.
>>
>> > On 29. Sep 2017, at 12:05, Gaurav1809  wrote:
>> >
>> > Hi All,
>> >
>> > I have multi node architecture of (1 master,2 workers) Spark cluster,
>> the
>> > job runs to read CSV file data and it works fine when run on local mode
>> > (Local(*)).
>> > However, when the same job is ran in cluster mode(Spark://HOST:PORT),
>> it is
>> > not able to read it.
>> > I want to know how to reference the files Or where to store them?
>> Currently
>> > the CSV data file is on master(from where the job is submitted).
>> >
>> > Following code works fine in local mode but not in cluster mode.
>> >
>> > val spark = SparkSession
>> >  .builder()
>> >  .appName("SampleFlightsApp")
>> >  .master("spark://masterIP:7077") // change it to
>> .master("local[*])
>> > for local mode
>> >  .getOrCreate()
>> >
>> >val flightDF =
>> > spark.read.option("header",true).csv("/home/username/sampleflightdata")
>> >flightDF.printSchema()
>> >
>> > Error: FileNotFoundException: File file:/home/username/sampleflightdata
>> does
>> > not exist
>> >
>> >
>> >
>> > --
>> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Alexander Czech
Yes you need to store the file at a location where it is equally
retrievable ("same path") for the master and all nodes in the cluster. A
simple solution (apart from a HDFS) that does not scale to well but might
be a OK with only 3 nodes like in your configuration is a network
accessible storage (a NAS or a shared folder for example).

hope this helps
Alexander

On Fri, Sep 29, 2017 at 12:05 PM, Sathishkumar Manimoorthy <
mrsathishkuma...@gmail.com> wrote:

> Place it in HDFS and give the reference path in your code.
>
> Thanks,
> Sathish
>
> On Fri, Sep 29, 2017 at 3:31 PM, Gaurav1809 
> wrote:
>
>> Hi All,
>>
>> I have multi node architecture of (1 master,2 workers) Spark cluster, the
>> job runs to read CSV file data and it works fine when run on local mode
>> (Local(*)). However, when the same job is ran in cluster mode
>> (Spark://HOST:PORT), it is not able to read it. I want to know how to
>> reference the files Or where to store them? Currently the CSV data file is
>> on master(from where the job is submitted).
>>
>> Following code works fine in local mode but not in cluster mode.
>>
>> val spark = SparkSession
>>   .builder()
>>   .appName("SampleFlightsApp")
>>   .master("spark://masterIP:7077") // change it to .master("local[*])
>> for local mode
>>   .getOrCreate()
>>
>> val flightDF =
>> spark.read.option("header",true).csv("/home/username/sampleflightdata")
>> flightDF.printSchema()
>>
>> Error: FileNotFoundException: File file:/home/gaurav/sampleflightdata
>> does
>> not exist
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Arun Rai
Or you can try mounting that drive to all node.

On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke  wrote:

> You should use a distributed filesystem such as HDFS. If you want to use
> the local filesystem then you have to copy each file to each node.
>
> > On 29. Sep 2017, at 12:05, Gaurav1809  wrote:
> >
> > Hi All,
> >
> > I have multi node architecture of (1 master,2 workers) Spark cluster, the
> > job runs to read CSV file data and it works fine when run on local mode
> > (Local(*)).
> > However, when the same job is ran in cluster mode(Spark://HOST:PORT), it
> is
> > not able to read it.
> > I want to know how to reference the files Or where to store them?
> Currently
> > the CSV data file is on master(from where the job is submitted).
> >
> > Following code works fine in local mode but not in cluster mode.
> >
> > val spark = SparkSession
> >  .builder()
> >  .appName("SampleFlightsApp")
> >  .master("spark://masterIP:7077") // change it to .master("local[*])
> > for local mode
> >  .getOrCreate()
> >
> >val flightDF =
> > spark.read.option("header",true).csv("/home/username/sampleflightdata")
> >flightDF.printSchema()
> >
> > Error: FileNotFoundException: File file:/home/username/sampleflightdata
> does
> > not exist
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Jörn Franke
You should use a distributed filesystem such as HDFS. If you want to use the 
local filesystem then you have to copy each file to each node.

> On 29. Sep 2017, at 12:05, Gaurav1809  wrote:
> 
> Hi All,
> 
> I have multi node architecture of (1 master,2 workers) Spark cluster, the
> job runs to read CSV file data and it works fine when run on local mode
> (Local(*)). 
> However, when the same job is ran in cluster mode(Spark://HOST:PORT), it is
> not able to read it. 
> I want to know how to reference the files Or where to store them? Currently
> the CSV data file is on master(from where the job is submitted).
> 
> Following code works fine in local mode but not in cluster mode.
> 
> val spark = SparkSession
>  .builder()
>  .appName("SampleFlightsApp")
>  .master("spark://masterIP:7077") // change it to .master("local[*])
> for local mode
>  .getOrCreate()
> 
>val flightDF =
> spark.read.option("header",true).csv("/home/username/sampleflightdata")
>flightDF.printSchema()
> 
> Error: FileNotFoundException: File file:/home/username/sampleflightdata does
> not exist
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Sathishkumar Manimoorthy
Place it in HDFS and give the reference path in your code.

Thanks,
Sathish

On Fri, Sep 29, 2017 at 3:31 PM, Gaurav1809  wrote:

> Hi All,
>
> I have multi node architecture of (1 master,2 workers) Spark cluster, the
> job runs to read CSV file data and it works fine when run on local mode
> (Local(*)). However, when the same job is ran in cluster mode
> (Spark://HOST:PORT), it is not able to read it. I want to know how to
> reference the files Or where to store them? Currently the CSV data file is
> on master(from where the job is submitted).
>
> Following code works fine in local mode but not in cluster mode.
>
> val spark = SparkSession
>   .builder()
>   .appName("SampleFlightsApp")
>   .master("spark://masterIP:7077") // change it to .master("local[*])
> for local mode
>   .getOrCreate()
>
> val flightDF =
> spark.read.option("header",true).csv("/home/username/sampleflightdata")
> flightDF.printSchema()
>
> Error: FileNotFoundException: File file:/home/gaurav/sampleflightdata does
> not exist
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>