[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem
[ https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312541#comment-15312541 ] Marco Capuccini commented on SPARK-15729: - I see! Thanks for the clarification. I'll open a PR, I think many people think that a distributed FS is optional when using Spark in a distributed environment. I did not mention that when I was running my applications with 1.4.0 the data was written to NFS, and maybe that's the reason why it was working fine. I have another question. Let's assume I have copied the input data on each node, in the exactly same path, and read it using sc.textFile. With such assumption, let's say I perform some analysis on the dataset, reducing it to something that can be collected on the driver node. Now if I collect the reduced dataset, and I save it only on the machine where the driver is running, using the Scala IO primitives, would this work? Or would there be some corruption in the results? > saveAsTextFile not working on regular filesystem > > > Key: SPARK-15729 > URL: https://issues.apache.org/jira/browse/SPARK-15729 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Marco Capuccini >Priority: Blocker > > I set up a standalone Spark cluster. I don't need HDFS, so I just want to > save the files on the regular file system in a distributed manner. For > testing purpose, I opened a Spark Shell, and I run the following code. > sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt") > I got no error from this, but if I go to inspect the /mnt/volume/test.txt > folder on each node this is what I see: > On the master (where I launched the spark shell): > /mnt/volume/test.txt/_SUCCESS > On the workers: > /mnt/volume/test.txt/_temporary > It seems like some failure occurred, but I didn't get any error. Is this a > bug, or am I missing something? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem
[ https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312511#comment-15312511 ] Sean Owen commented on SPARK-15729: --- It would only make sense if you were running all on one machine, yes. In that case it should work as expected. Otherwise, you have a bunch of processes writing to entirely different "same" filesystems and who knows what happens. If you want to open a PR to change that mention to clarify this only makes sense for local operation, that's fine, and we can reopen for that purpose. Same for saveAsSequenceFile. > saveAsTextFile not working on regular filesystem > > > Key: SPARK-15729 > URL: https://issues.apache.org/jira/browse/SPARK-15729 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Marco Capuccini >Priority: Blocker > > I set up a standalone Spark cluster. I don't need HDFS, so I just want to > save the files on the regular file system in a distributed manner. For > testing purpose, I opened a Spark Shell, and I run the following code. > sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt") > I got no error from this, but if I go to inspect the /mnt/volume/test.txt > folder on each node this is what I see: > On the master (where I launched the spark shell): > /mnt/volume/test.txt/_SUCCESS > On the workers: > /mnt/volume/test.txt/_temporary > It seems like some failure occurred, but I didn't get any error. Is this a > bug, or am I missing something? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem
[ https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312506#comment-15312506 ] Marco Capuccini commented on SPARK-15729: - [~srowen] I am sorry for reopen the issue. So saving to the regular file system is allowed only when using the "local" master? I don't understand. The documentation is clearly states that saveAsTextFile "Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem ...". It is not mentioned that the regular file system is not allowed in case of previous operation. Also why this same code was working in 1.4.0? > saveAsTextFile not working on regular filesystem > > > Key: SPARK-15729 > URL: https://issues.apache.org/jira/browse/SPARK-15729 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Marco Capuccini >Priority: Blocker > > I set up a standalone Spark cluster. I don't need HDFS, so I just want to > save the files on the regular file system in a distributed manner. For > testing purpose, I opened a Spark Shell, and I run the following code. > sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt") > I got no error from this, but if I go to inspect the /mnt/volume/test.txt > folder on each node this is what I see: > On the master (where I launched the spark shell): > /mnt/volume/test.txt/_SUCCESS > On the workers: > /mnt/volume/test.txt/_temporary > It seems like some failure occurred, but I didn't get any error. Is this a > bug, or am I missing something? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem
[ https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312482#comment-15312482 ] Marco Capuccini commented on SPARK-15729: - [~sowen] I know what I am doing. I know that each of the node will have a piece of the resulting dataset. I am sure that this is a bug, as in version 1.4.0 this is working, and the parts are correctly located in the root directory of /mnt/volume/test.txt in each node. This is what I see under /mnt/volume/test.txt in version 1.4.0: master: _SUCCESS worker 1: PART_0001 ... PART_000M worker 2: PART_000M ... In Spark 1.6.0, I see something like: master: _SUCCESS worker 1: _temporary worker 2: _temporary You can do this simple test, when you save to the regular file system: sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt") val count = sc.textFile("file:///mnt/volume/test.txt").count println(count) The result will be 0 in version 1.6.0. If this is not a bug, I'd like to understand what is going on on version 1.6.x, as in 1.4.0 this is working as a charm. > saveAsTextFile not working on regular filesystem > > > Key: SPARK-15729 > URL: https://issues.apache.org/jira/browse/SPARK-15729 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Marco Capuccini >Priority: Blocker > > I set up a standalone Spark cluster. I don't need HDFS, so I just want to > save the files on the regular file system in a distributed manner. For > testing purpose, I opened a Spark Shell, and I run the following code. > sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt") > I got no error from this, but if I go to inspect the /mnt/volume/test.txt > folder on each node this is what I see: > On the master (where I launched the spark shell): > /mnt/volume/test.txt/_SUCCESS > On the workers: > /mnt/volume/test.txt/_temporary > It seems like some failure occurred, but I didn't get any error. Is this a > bug, or am I missing something? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org