[ 
https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312541#comment-15312541
 ] 

Marco Capuccini edited comment on SPARK-15729 at 6/2/16 4:05 PM:
-----------------------------------------------------------------

[~sowen] I see! Thanks for the clarification. I'll open a PR, I think many 
people think that a distributed FS is optional when using Spark in a 
distributed environment. I did not mention that when I was running my 
applications with 1.4.0 the data was written to NFS, and maybe that's the 
reason why it was working fine. 

I have another question. Let's assume I have copied the input data on each 
node, in the exactly same path, and read it using sc.textFile. With such 
assumption, let's say I perform some analysis on the dataset, reducing it to 
something that can be collected on the driver node. Now if I collect the 
reduced dataset, and I save it only on the machine where the driver is running, 
using the Scala IO primitives, would this work? Or would there be some 
corruption in the results?




was (Author: m.capucc...@gmail.com):
I see! Thanks for the clarification. I'll open a PR, I think many people think 
that a distributed FS is optional when using Spark in a distributed 
environment. I did not mention that when I was running my applications with 
1.4.0 the data was written to NFS, and maybe that's the reason why it was 
working fine. 

I have another question. Let's assume I have copied the input data on each 
node, in the exactly same path, and read it using sc.textFile. With such 
assumption, let's say I perform some analysis on the dataset, reducing it to 
something that can be collected on the driver node. Now if I collect the 
reduced dataset, and I save it only on the machine where the driver is running, 
using the Scala IO primitives, would this work? Or would there be some 
corruption in the results?



> saveAsTextFile not working on regular filesystem
> ------------------------------------------------
>
>                 Key: SPARK-15729
>                 URL: https://issues.apache.org/jira/browse/SPARK-15729
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: Marco Capuccini
>            Priority: Blocker
>
> I set up a standalone Spark cluster. I don't need HDFS, so I just want to 
> save the files on the regular file system in a distributed manner. For 
> testing purpose, I opened a Spark Shell, and I run the following code.
> sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt")
> I got no error from this, but if I go to inspect the /mnt/volume/test.txt 
> folder on each node this is what I see:
> On the master (where I launched the spark shell):
> /mnt/volume/test.txt/_SUCCESS
> On the workers:
> /mnt/volume/test.txt/_temporary
> It seems like some failure occurred, but I didn't get any error. Is this a 
> bug, or am I missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to