Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ajay Chander
k 0.0 in stage 3.0 (TID
>> 6) in 312 ms on nodo2.i3a.info (1/2)
>> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
>> 7) in 313 ms on nodo3.i3a.info (2/2)
>> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks
>> have
>> all completed, from pool
>> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
>> :27) finished in 0.334 s
>> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
>> :27, took 0.436388 s
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-creates-an-empty-folder-in-HDFS-tp24906.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>
>>
>
>


Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Jacinto Arias
Yes printing the result with collect or take is working,

actually this is a minimal example, but also when working with real data the 
actions are performed, and the resulting RDDs can be printed out without 
problem. The data is there and the operations are correct, they just cannot be 
written to a file.


> On 03 Oct 2015, at 16:17, Ted Yu  <mailto:yuzhih...@gmail.com>> wrote:
> 
> bq.  val dist = sc.parallelize(l)
> 
> Following the above, can you call, e.g. count() on dist before saving ?
> 
> Cheers
> 
> On Fri, Oct 2, 2015 at 1:21 AM, jarias  <mailto:ja...@elrocin.es>> wrote:
> Dear list,
> 
> I'm experimenting a problem when trying to write any RDD to HDFS. I've tried
> with minimal examples, scala programs and pyspark programs both in local and
> cluster modes and as standalone applications or shells.
> 
> My problem is that when invoking the write command, a task is executed but
> it just creates an empty folder in the given HDFS path. I'm lost at this
> point because there is no sign of error or warning in the spark logs.
> 
> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
> working properly when using the command tools or running MapReduce jobs.
> 
> 
> Thank you for your time, I'm not sure if this is just a rookie mistake or an
> overall config problem.
> 
> Just a working example:
> 
> This sequence produces the following log and creates the empty folder
> "test":
> 
> scala> val l = Seq.fill(1)(nextInt)
> scala> val dist = sc.parallelize(l)
> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/ 
> <http://node1.i3a.info/user/jarias/test/>")
> 
> 
> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 1
> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
> :27
> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
> :27) with 2 output partitions (allowLocal=false)
> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at
> :27)
> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[7]
> at saveAsTextFile at :27), which has no missing parents
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
> curMem=184615, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 134.1 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
> curMem=321951, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
> in memory (estimated size 46.6 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo1.i3a.info:36330 <http://nodo1.i3a.info:36330/> (size: 46.6 KB, free: 
> 265.3 MB)
> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
> broadcast_3_piece0
> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:839
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3
> (MapPartitionsRDD[7] at saveAsTextFile at :27)
> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, nodo2.i3a.info <http://nodo2.i3a.info/>, PROCESS_LOCAL, 25975 bytes)
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, nodo3.i3a.info <http://nodo3.i3a.info/>, PROCESS_LOCAL, 25963 bytes)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo2.i3a.info:37759 <http://nodo2.i3a.info:37759/> (size: 46.6 KB, free: 
> 530.2 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo3.i3a.info:54798 <http://nodo3.i3a.info:54798/> (size: 46.6 KB, free: 
> 530.2 MB)
> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 312 ms on nodo2.i3a.info <http://nodo2.i3a.info/> (1/2)
> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 313 ms on nodo3.i3a.info <http://nodo3.i3a.info/> (2/2)
> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
> all completed, from pool
> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
> :27) finished in 0.334 s
> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
> :27, took 0.436388 s
> 
> 
> 
> 
> --
> View this message in context: 
> http://a

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ted Yu
bq.  val dist = sc.parallelize(l)

Following the above, can you call, e.g. count() on dist before saving ?

Cheers

On Fri, Oct 2, 2015 at 1:21 AM, jarias  wrote:

> Dear list,
>
> I'm experimenting a problem when trying to write any RDD to HDFS. I've
> tried
> with minimal examples, scala programs and pyspark programs both in local
> and
> cluster modes and as standalone applications or shells.
>
> My problem is that when invoking the write command, a task is executed but
> it just creates an empty folder in the given HDFS path. I'm lost at this
> point because there is no sign of error or warning in the spark logs.
>
> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
> working properly when using the command tools or running MapReduce jobs.
>
>
> Thank you for your time, I'm not sure if this is just a rookie mistake or
> an
> overall config problem.
>
> Just a working example:
>
> This sequence produces the following log and creates the empty folder
> "test":
>
> scala> val l = Seq.fill(1)(nextInt)
> scala> val dist = sc.parallelize(l)
> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/")
>
>
> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 1
> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
> :27
> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
> :27) with 2 output partitions (allowLocal=false)
> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at
> :27)
> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3
> (MapPartitionsRDD[7]
> at saveAsTextFile at :27), which has no missing parents
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
> curMem=184615, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 134.1 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
> curMem=321951, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as
> bytes
> in memory (estimated size 46.6 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo1.i3a.info:36330 (size: 46.6 KB, free: 265.3 MB)
> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
> broadcast_3_piece0
> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:839
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 3
> (MapPartitionsRDD[7] at saveAsTextFile at :27)
> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, nodo2.i3a.info, PROCESS_LOCAL, 25975 bytes)
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, nodo3.i3a.info, PROCESS_LOCAL, 25963 bytes)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo2.i3a.info:37759 (size: 46.6 KB, free: 530.2 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo3.i3a.info:54798 (size: 46.6 KB, free: 530.2 MB)
> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 312 ms on nodo2.i3a.info (1/2)
> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 313 ms on nodo3.i3a.info (2/2)
> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
> all completed, from pool
> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
> :27) finished in 0.334 s
> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
> :27, took 0.436388 s
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-creates-an-empty-folder-in-HDFS-tp24906.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


saveAsTextFile creates an empty folder in HDFS

2015-10-02 Thread jarias
Dear list,

I'm experimenting a problem when trying to write any RDD to HDFS. I've tried
with minimal examples, scala programs and pyspark programs both in local and
cluster modes and as standalone applications or shells.

My problem is that when invoking the write command, a task is executed but
it just creates an empty folder in the given HDFS path. I'm lost at this
point because there is no sign of error or warning in the spark logs.

I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
working properly when using the command tools or running MapReduce jobs.


Thank you for your time, I'm not sure if this is just a rookie mistake or an
overall config problem.

Just a working example:

This sequence produces the following log and creates the empty folder
"test":

scala> val l = Seq.fill(1)(nextInt)
scala> val dist = sc.parallelize(l)
scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/")


15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer Algorithm
version is 1
15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
:27
15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
:27) with 2 output partitions (allowLocal=false)
15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at
:27)
15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[7]
at saveAsTextFile at :27), which has no missing parents
15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
curMem=184615, maxMem=278302556
15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 134.1 KB, free 265.1 MB)
15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
curMem=321951, maxMem=278302556
15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 46.6 KB, free 265.1 MB)
15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on nodo1.i3a.info:36330 (size: 46.6 KB, free: 265.3 MB)
15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
DAGScheduler.scala:839
15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3
(MapPartitionsRDD[7] at saveAsTextFile at :27)
15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, nodo2.i3a.info, PROCESS_LOCAL, 25975 bytes)
15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
7, nodo3.i3a.info, PROCESS_LOCAL, 25963 bytes)
15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on nodo2.i3a.info:37759 (size: 46.6 KB, free: 530.2 MB)
15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on nodo3.i3a.info:54798 (size: 46.6 KB, free: 530.2 MB)
15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
6) in 312 ms on nodo2.i3a.info (1/2)
15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
7) in 313 ms on nodo3.i3a.info (2/2)
15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
all completed, from pool 
15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
:27) finished in 0.334 s
15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
:27, took 0.436388 s




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-creates-an-empty-folder-in-HDFS-tp24906.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org