RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Wang, Ningjun (LNG-NPV)
Can anybody answer this? Do I have to have hdfs to achieve this?

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541

From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent: Friday, January 16, 2015 1:15 PM
To: Imran Rashid
Cc: user@spark.apache.org
Subject: RE: Can I save RDD to local file system and then read it back on spark 
cluster with multiple nodes?

I need to save RDD to file system and then restore my RDD from the file system 
in the future. I don’t have any hdfs file system and don’t want to go the 
hassle of setting up a hdfs system. So how can I achieve this? The application 
need to be run on a cluster with multiple nodes.

Regards,

Ningjun

From: imranra...@gmail.commailto:imranra...@gmail.com 
[mailto:imranra...@gmail.com] On Behalf Of Imran Rashid
Sent: Friday, January 16, 2015 12:14 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Can I save RDD to local file system and then read it back on spark 
cluster with multiple nodes?


I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to 
know something about the file, eg to know how many tasks to create.  But it 
won't even be able to see the file, since it only lives on the local filesystem 
of the cluster nodes.

If you really wanted to, you could probably write out some small metadata about 
the files and write your own version of objectFile that uses it.  But I think 
there is a bigger conceptual issue.  You might not in general be sure that you 
are running on the same nodes when you save the file, as when you read it back 
in.  So the file might not be present on the local filesystem for the active 
executors.  You might be able to guarantee it for the specific cluster setup 
you have now, but it might limit you down the road.

What are you trying to achieve?  There might be a better way.  I believe 
writing to hdfs will usually write one local copy, so you'd still be doing a 
local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
I have asked this question before but get no answer. Asking again.

Can I save RDD to the local file system and then read it back on a spark 
cluster with multiple nodes?

rdd.saveAsObjectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

val rdd2 = sc.objectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

This will works if the cluster has only one node. But my cluster has 3 nodes 
and each node has a local dir called /home/data. Is rdd saved to the local dir 
across 3 nodes? If so, does sc.objectFile(…) smart enough to read the local dir 
in all 3 nodes to merge them into a single rdd?

Ningjun



Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Davies Liu
If the dataset is not huge (in a few GB), you can setup NFS instead of
HDFS (which is much harder to setup):

1. export a directory in master (or anyone in the cluster)
2. mount it in the same position across all slaves
3. read/write from it by file:///path/to/monitpoint

On Tue, Jan 20, 2015 at 7:55 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
 Can anybody answer this? Do I have to have hdfs to achieve this?



 Regards,



 Ningjun Wang

 Consulting Software Engineer

 LexisNexis

 121 Chanlon Road

 New Providence, NJ 07974-1541



 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
 Sent: Friday, January 16, 2015 1:15 PM
 To: Imran Rashid
 Cc: user@spark.apache.org
 Subject: RE: Can I save RDD to local file system and then read it back on
 spark cluster with multiple nodes?



 I need to save RDD to file system and then restore my RDD from the file
 system in the future. I don’t have any hdfs file system and don’t want to go
 the hassle of setting up a hdfs system. So how can I achieve this? The
 application need to be run on a cluster with multiple nodes.



 Regards,



 Ningjun



 From: imranra...@gmail.com [mailto:imranra...@gmail.com] On Behalf Of Imran
 Rashid
 Sent: Friday, January 16, 2015 12:14 PM
 To: Wang, Ningjun (LNG-NPV)
 Cc: user@spark.apache.org
 Subject: Re: Can I save RDD to local file system and then read it back on
 spark cluster with multiple nodes?



 I'm not positive, but I think this is very unlikely to work.

 First, when you call sc.objectFile(...),  I think the *driver* will need to
 know something about the file, eg to know how many tasks to create.  But it
 won't even be able to see the file, since it only lives on the local
 filesystem of the cluster nodes.

 If you really wanted to, you could probably write out some small metadata
 about the files and write your own version of objectFile that uses it.  But
 I think there is a bigger conceptual issue.  You might not in general be
 sure that you are running on the same nodes when you save the file, as when
 you read it back in.  So the file might not be present on the local
 filesystem for the active executors.  You might be able to guarantee it for
 the specific cluster setup you have now, but it might limit you down the
 road.

 What are you trying to achieve?  There might be a better way.  I believe
 writing to hdfs will usually write one local copy, so you'd still be doing a
 local read when you reload the data.

 Imran

 On Jan 16, 2015 6:19 AM, Wang, Ningjun (LNG-NPV)
 ningjun.w...@lexisnexis.com wrote:

 I have asked this question before but get no answer. Asking again.



 Can I save RDD to the local file system and then read it back on a spark
 cluster with multiple nodes?



 rdd.saveAsObjectFile(“file:///home/data/rdd1”)



 val rdd2 = sc.objectFile(“file:///home/data/rdd1”)



 This will works if the cluster has only one node. But my cluster has 3 nodes
 and each node has a local dir called /home/data. Is rdd saved to the local
 dir across 3 nodes? If so, does sc.objectFile(…) smart enough to read the
 local dir in all 3 nodes to merge them into a single rdd?



 Ningjun



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Mohammed Guller
I don’t think it will work without HDFS.

Mohammed

From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent: Tuesday, January 20, 2015 7:55 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: RE: Can I save RDD to local file system and then read it back on spark 
cluster with multiple nodes?

Can anybody answer this? Do I have to have hdfs to achieve this?

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541

From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent: Friday, January 16, 2015 1:15 PM
To: Imran Rashid
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: Can I save RDD to local file system and then read it back on spark 
cluster with multiple nodes?

I need to save RDD to file system and then restore my RDD from the file system 
in the future. I don’t have any hdfs file system and don’t want to go the 
hassle of setting up a hdfs system. So how can I achieve this? The application 
need to be run on a cluster with multiple nodes.

Regards,

Ningjun

From: imranra...@gmail.commailto:imranra...@gmail.com 
[mailto:imranra...@gmail.com] On Behalf Of Imran Rashid
Sent: Friday, January 16, 2015 12:14 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Can I save RDD to local file system and then read it back on spark 
cluster with multiple nodes?


I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to 
know something about the file, eg to know how many tasks to create.  But it 
won't even be able to see the file, since it only lives on the local filesystem 
of the cluster nodes.

If you really wanted to, you could probably write out some small metadata about 
the files and write your own version of objectFile that uses it.  But I think 
there is a bigger conceptual issue.  You might not in general be sure that you 
are running on the same nodes when you save the file, as when you read it back 
in.  So the file might not be present on the local filesystem for the active 
executors.  You might be able to guarantee it for the specific cluster setup 
you have now, but it might limit you down the road.

What are you trying to achieve?  There might be a better way.  I believe 
writing to hdfs will usually write one local copy, so you'd still be doing a 
local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
I have asked this question before but get no answer. Asking again.

Can I save RDD to the local file system and then read it back on a spark 
cluster with multiple nodes?

rdd.saveAsObjectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

val rdd2 = sc.objectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

This will works if the cluster has only one node. But my cluster has 3 nodes 
and each node has a local dir called /home/data. Is rdd saved to the local dir 
across 3 nodes? If so, does sc.objectFile(…) smart enough to read the local dir 
in all 3 nodes to merge them into a single rdd?

Ningjun



Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
I have asked this question before but get no answer. Asking again.

Can I save RDD to the local file system and then read it back on a spark 
cluster with multiple nodes?

rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1)

val rdd2 = sc.objectFile(file:///home/data/rdd1file:///\\home\data\rdd1)

This will works if the cluster has only one node. But my cluster has 3 nodes 
and each node has a local dir called /home/data. Is rdd saved to the local dir 
across 3 nodes? If so, does sc.objectFile(...) smart enough to read the local 
dir in all 3 nodes to merge them into a single rdd?

Ningjun



Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Imran Rashid
I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to
know something about the file, eg to know how many tasks to create.  But it
won't even be able to see the file, since it only lives on the local
filesystem of the cluster nodes.

If you really wanted to, you could probably write out some small metadata
about the files and write your own version of objectFile that uses it.  But
I think there is a bigger conceptual issue.  You might not in general be
sure that you are running on the same nodes when you save the file, as when
you read it back in.  So the file might not be present on the local
filesystem for the active executors.  You might be able to guarantee it for
the specific cluster setup you have now, but it might limit you down the
road.

What are you trying to achieve?  There might be a better way.  I believe
writing to hdfs will usually write one local copy, so you'd still be doing
a local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  I have asked this question before but get no answer. Asking again.



 Can I save RDD to the local file system and then read it back on a spark
 cluster with multiple nodes?



 rdd.saveAsObjectFile(“file:///home/data/rdd1”)



 val rdd2 = sc.objectFile(“file:///home/data/rdd1”)



 This will works if the cluster has only one node. But my cluster has 3
 nodes and each node has a local dir called /home/data. Is rdd saved to the
 local dir across 3 nodes? If so, does sc.objectFile(…) smart enough to read
 the local dir in all 3 nodes to merge them into a single rdd?



 Ningjun





RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
I need to save RDD to file system and then restore my RDD from the file system 
in the future. I don’t have any hdfs file system and don’t want to go the 
hassle of setting up a hdfs system. So how can I achieve this? The application 
need to be run on a cluster with multiple nodes.

Regards,

Ningjun

From: imranra...@gmail.com [mailto:imranra...@gmail.com] On Behalf Of Imran 
Rashid
Sent: Friday, January 16, 2015 12:14 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: Can I save RDD to local file system and then read it back on spark 
cluster with multiple nodes?


I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to 
know something about the file, eg to know how many tasks to create.  But it 
won't even be able to see the file, since it only lives on the local filesystem 
of the cluster nodes.

If you really wanted to, you could probably write out some small metadata about 
the files and write your own version of objectFile that uses it.  But I think 
there is a bigger conceptual issue.  You might not in general be sure that you 
are running on the same nodes when you save the file, as when you read it back 
in.  So the file might not be present on the local filesystem for the active 
executors.  You might be able to guarantee it for the specific cluster setup 
you have now, but it might limit you down the road.

What are you trying to achieve?  There might be a better way.  I believe 
writing to hdfs will usually write one local copy, so you'd still be doing a 
local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
I have asked this question before but get no answer. Asking again.

Can I save RDD to the local file system and then read it back on a spark 
cluster with multiple nodes?

rdd.saveAsObjectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

val rdd2 = sc.objectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

This will works if the cluster has only one node. But my cluster has 3 nodes 
and each node has a local dir called /home/data. Is rdd saved to the local dir 
across 3 nodes? If so, does sc.objectFile(…) smart enough to read the local dir 
in all 3 nodes to merge them into a single rdd?

Ningjun



Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-14 Thread Wang, Ningjun (LNG-NPV)
Can I save RDD to the local file system and then read it back on a spark 
cluster with multiple nodes?

rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1)

val rdd2 = sc.objectFile(file:///home/data/rdd1file:///\\home\data\rdd1)

This will works if the cluster has only one node. But my cluster has 3 nodes 
and each node has a local dir called /home/data. Is rdd saved to the local dir 
across 3 nodes? If so, does sc.objectFile(...) smart enough to read the local 
dir in all nodes to merge them into a single rdd?

Ningjun