[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem

2016-06-02 Thread Marco Capuccini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312541#comment-15312541
 ] 

Marco Capuccini commented on SPARK-15729:
-

I see! Thanks for the clarification. I'll open a PR, I think many people think 
that a distributed FS is optional when using Spark in a distributed 
environment. I did not mention that when I was running my applications with 
1.4.0 the data was written to NFS, and maybe that's the reason why it was 
working fine. 

I have another question. Let's assume I have copied the input data on each 
node, in the exactly same path, and read it using sc.textFile. With such 
assumption, let's say I perform some analysis on the dataset, reducing it to 
something that can be collected on the driver node. Now if I collect the 
reduced dataset, and I save it only on the machine where the driver is running, 
using the Scala IO primitives, would this work? Or would there be some 
corruption in the results?



> saveAsTextFile not working on regular filesystem
> 
>
> Key: SPARK-15729
> URL: https://issues.apache.org/jira/browse/SPARK-15729
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Marco Capuccini
>Priority: Blocker
>
> I set up a standalone Spark cluster. I don't need HDFS, so I just want to 
> save the files on the regular file system in a distributed manner. For 
> testing purpose, I opened a Spark Shell, and I run the following code.
> sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt")
> I got no error from this, but if I go to inspect the /mnt/volume/test.txt 
> folder on each node this is what I see:
> On the master (where I launched the spark shell):
> /mnt/volume/test.txt/_SUCCESS
> On the workers:
> /mnt/volume/test.txt/_temporary
> It seems like some failure occurred, but I didn't get any error. Is this a 
> bug, or am I missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem

2016-06-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312511#comment-15312511
 ] 

Sean Owen commented on SPARK-15729:
---

It would only make sense if you were running all on one machine, yes. In that 
case it should work as expected. Otherwise, you have a bunch of processes 
writing to entirely different "same" filesystems and who knows what happens.

If you want to open a PR to change that mention to clarify this only makes 
sense for local operation, that's fine, and we can reopen for that purpose. 
Same for saveAsSequenceFile.

> saveAsTextFile not working on regular filesystem
> 
>
> Key: SPARK-15729
> URL: https://issues.apache.org/jira/browse/SPARK-15729
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Marco Capuccini
>Priority: Blocker
>
> I set up a standalone Spark cluster. I don't need HDFS, so I just want to 
> save the files on the regular file system in a distributed manner. For 
> testing purpose, I opened a Spark Shell, and I run the following code.
> sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt")
> I got no error from this, but if I go to inspect the /mnt/volume/test.txt 
> folder on each node this is what I see:
> On the master (where I launched the spark shell):
> /mnt/volume/test.txt/_SUCCESS
> On the workers:
> /mnt/volume/test.txt/_temporary
> It seems like some failure occurred, but I didn't get any error. Is this a 
> bug, or am I missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem

2016-06-02 Thread Marco Capuccini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312506#comment-15312506
 ] 

Marco Capuccini commented on SPARK-15729:
-

[~srowen] I am sorry for reopen the issue. So saving to the regular file system 
is allowed only when using the "local" master? I don't understand. The 
documentation is clearly states that saveAsTextFile "Write the elements of the 
dataset as a text file (or set of text files) in a given directory in the local 
filesystem ...". It is not mentioned that the regular file system is not 
allowed in case of previous operation. Also why this same code was working in 
1.4.0?

> saveAsTextFile not working on regular filesystem
> 
>
> Key: SPARK-15729
> URL: https://issues.apache.org/jira/browse/SPARK-15729
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Marco Capuccini
>Priority: Blocker
>
> I set up a standalone Spark cluster. I don't need HDFS, so I just want to 
> save the files on the regular file system in a distributed manner. For 
> testing purpose, I opened a Spark Shell, and I run the following code.
> sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt")
> I got no error from this, but if I go to inspect the /mnt/volume/test.txt 
> folder on each node this is what I see:
> On the master (where I launched the spark shell):
> /mnt/volume/test.txt/_SUCCESS
> On the workers:
> /mnt/volume/test.txt/_temporary
> It seems like some failure occurred, but I didn't get any error. Is this a 
> bug, or am I missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15729) saveAsTextFile not working on regular filesystem

2016-06-02 Thread Marco Capuccini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312482#comment-15312482
 ] 

Marco Capuccini commented on SPARK-15729:
-

[~sowen] I know what I am doing. I know that each of the node will have a piece 
of the resulting dataset. I am sure that this is a bug, as in version 1.4.0 
this is working, and the parts are correctly located in the root directory of 
/mnt/volume/test.txt in each node. 

This is what I see under /mnt/volume/test.txt in version 1.4.0:
master:
_SUCCESS

worker 1:
PART_0001 ... PART_000M

worker 2:
PART_000M ... 

In Spark 1.6.0, I see something like:
master:
_SUCCESS

worker 1:
_temporary

worker 2:
_temporary

You can do this simple test, when you save to the regular file system:

sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt")
val count = sc.textFile("file:///mnt/volume/test.txt").count
println(count)

The result will be 0 in version 1.6.0. If this is not a bug, I'd like to 
understand what is going on on version 1.6.x, as in 1.4.0 this is working as a 
charm.

> saveAsTextFile not working on regular filesystem
> 
>
> Key: SPARK-15729
> URL: https://issues.apache.org/jira/browse/SPARK-15729
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Marco Capuccini
>Priority: Blocker
>
> I set up a standalone Spark cluster. I don't need HDFS, so I just want to 
> save the files on the regular file system in a distributed manner. For 
> testing purpose, I opened a Spark Shell, and I run the following code.
> sc.parallelize(1 to 100).saveAsTextFile("file:///mnt/volume/test.txt")
> I got no error from this, but if I go to inspect the /mnt/volume/test.txt 
> folder on each node this is what I see:
> On the master (where I launched the spark shell):
> /mnt/volume/test.txt/_SUCCESS
> On the workers:
> /mnt/volume/test.txt/_temporary
> It seems like some failure occurred, but I didn't get any error. Is this a 
> bug, or am I missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org