Re: saveAsTextFile at treeEnsembleModels.scala:447, took 2.513396 s Killed

2016-07-28 Thread Ascot Moss
Hi,

Thanks for your reply.

permissions (access) is not an issue in my case, it is because this issue
only happened when the bigger input file was used to generate the model,
i.e. with smaller input(s) all worked well.   It seems to me that ".save"
cannot save big file.

Q1: Any idea about the size  limit that ".save" can handle?
Q2: Any idea about how to check the size model that will be saved vis
".save" ?

Regards



On Thu, Jul 28, 2016 at 4:19 PM, Spico Florin  wrote:

> Hi!
>   There are many reasons that your task is failed. One could be that you
> don't have proper permissions (access) to  hdfs with your user. Please
> check your user rights to write in hdfs. Please have a look also :
>
> http://stackoverflow.com/questions/27427042/spark-unable-to-save-in-hadoop-permission-denied-for-user
> I hope it jelps.
>  Florin
>
>
> On Thu, Jul 28, 2016 at 3:49 AM, Ascot Moss  wrote:
>
>>
>> Hi,
>>
>> Please help!
>>
>> When saving the model, I got following error and cannot save the model to
>> hdfs:
>>
>> (my source code, my spark is v1.6.2)
>> my_model.save(sc, "/my_model")
>>
>> -
>> 16/07/28 08:36:19 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose
>> tasks have all completed, from pool
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: ResultStage 69 (saveAsTextFile at
>> treeEnsembleModels.scala:447) finished in 0.901 s
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: Job 38 finished: saveAsTextFile at
>> treeEnsembleModels.scala:447, took 2.513396 s
>>
>> Killed
>> -
>>
>>
>> Q1: Is there any limitation on saveAsTextFile?
>> Q2: or where to find the error log file location?
>>
>> Regards
>>
>>
>>
>>
>>
>


RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
If the data is not too big, one option is to call the collect method and then 
save the result to a local file using standard Java/Scala API. However, keep in 
mind that this will transfer data from all the worker nodes to the driver 
program. Looks like that is what you want to do anyway, but you need to be 
aware of how big that data is and related implications.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Monday, February 1, 2016 6:00 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohamed,

Thanks for your response. Data is available in worker nodes. But looking for 
something to write directly to local fs. Seems like it is not an option.

Thanks,
Sivakumar Bhavanari.

On Mon, Feb 1, 2016 at 5:45 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
You should not be saving an RDD to local FS if Spark is running on a real 
cluster. Essentially, each Spark worker will save the partitions that it 
processes locally.

Check the directories on the worker nodes and you should find pieces of your 
file on each node.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 2016 5:40 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in 
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG mode. 
I see the below exception, but this exception occurred after saveAsTextfile 
function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 201

RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
You should not be saving an RDD to local FS if Spark is running on a real 
cluster. Essentially, each Spark worker will save the partitions that it 
processes locally.

Check the directories on the worker nodes and you should find pieces of your 
file on each node.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Friday, January 29, 2016 5:40 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in 
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG mode. 
I see the below exception, but this exception occurred after saveAsTextfile 
function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 2016 3:38 PM
To: spark users
Subject: saveAsTextFile is not writing to local fs

Hi Everyone,

We are using spark 1.4.1 and we have a requirement of writing data local fs 
instead of hdfs.

When trying to save rdd to local fs with saveAsTextFile, it is just writing 
_SUCCESS file in the folder with no part- files and also no error or warning 
messages on console.

Is there any place to look at to fix this problem?

Thanks,
Sivakumar Bhavanari.



Re: saveAsTextFile is not writing to local fs

2016-02-01 Thread Siva
Hi Mohamed,

Thanks for your response. Data is available in worker nodes. But looking
for something to write directly to local fs. Seems like it is not an option.

Thanks,
Sivakumar Bhavanari.

On Mon, Feb 1, 2016 at 5:45 PM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> You should not be saving an RDD to local FS if Spark is running on a real
> cluster. Essentially, each Spark worker will save the partitions that it
> processes locally.
>
>
>
> Check the directories on the worker nodes and you should find pieces of
> your file on each node.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 5:40 PM
> *To:* Mohammed Guller
> *Cc:* spark users
> *Subject:* Re: saveAsTextFile is not writing to local fs
>
>
>
> Hi Mohammed,
>
>
>
> Thanks for your quick response. I m submitting spark job to Yarn in
> "yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG
> mode. I see the below exception, but this exception occurred after
> saveAsTextfile function is finished.
>
>
>
> 16/01/29 20:26:57 DEBUG HttpParser:
>
> java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:190)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>
> at
> org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
>
> at
> org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
>
> at
> org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
>
> at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
>
> at
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>
> at
> org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 16/01/29 20:26:57 DEBUG HttpParser:
>
> java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:190)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>
> at
> org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
>
> at
> org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
>
> at
> org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
>
> at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
>
> at
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>
> at
> org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
>
> org.spark-project.jetty.io.EofException
>
>
>
> Do you think this one this causing this?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>
>
>
> On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
>
> Is it a multi-node cluster or you running Spark on a single machine?
>
>
>
> You can change Spark’s logging level to INFO or DEBUG to see what is going
> on.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 3:38 PM
> *To:* spark users
> *Subject:* saveAsTextFile is not writing to local fs
>
>
>
> Hi Everyone,
>
>
>
> We are using spark 1.4.1 and we have a requirement of writing data local
> fs instead of hdfs.
>
>
>
> When trying to save rdd to local fs with saveAsTextFile, it is just
> writing _SUCCESS file in the folder with no part- files and also no error
> or warning messages on console.
>
>
>
> Is there any place to look at to fix this problem?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>
>
>


RE: saveAsTextFile is not writing to local fs

2016-01-29 Thread Mohammed Guller
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Friday, January 29, 2016 3:38 PM
To: spark users
Subject: saveAsTextFile is not writing to local fs

Hi Everyone,

We are using spark 1.4.1 and we have a requirement of writing data local fs 
instead of hdfs.

When trying to save rdd to local fs with saveAsTextFile, it is just writing 
_SUCCESS file in the folder with no part- files and also no error or warning 
messages on console.

Is there any place to look at to fix this problem?

Thanks,
Sivakumar Bhavanari.


Re: saveAsTextFile is not writing to local fs

2016-01-29 Thread Siva
Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG
mode. I see the below exception, but this exception occurred after
saveAsTextfile function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at
org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at
org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
wrote:

> Is it a multi-node cluster or you running Spark on a single machine?
>
>
>
> You can change Spark’s logging level to INFO or DEBUG to see what is going
> on.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 3:38 PM
> *To:* spark users
> *Subject:* saveAsTextFile is not writing to local fs
>
>
>
> Hi Everyone,
>
>
>
> We are using spark 1.4.1 and we have a requirement of writing data local
> fs instead of hdfs.
>
>
>
> When trying to save rdd to local fs with saveAsTextFile, it is just
> writing _SUCCESS file in the folder with no part- files and also no error
> or warning messages on console.
>
>
>
> Is there any place to look at to fix this problem?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>


Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ted Yu
bq.  val dist = sc.parallelize(l)

Following the above, can you call, e.g. count() on dist before saving ?

Cheers

On Fri, Oct 2, 2015 at 1:21 AM, jarias  wrote:

> Dear list,
>
> I'm experimenting a problem when trying to write any RDD to HDFS. I've
> tried
> with minimal examples, scala programs and pyspark programs both in local
> and
> cluster modes and as standalone applications or shells.
>
> My problem is that when invoking the write command, a task is executed but
> it just creates an empty folder in the given HDFS path. I'm lost at this
> point because there is no sign of error or warning in the spark logs.
>
> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
> working properly when using the command tools or running MapReduce jobs.
>
>
> Thank you for your time, I'm not sure if this is just a rookie mistake or
> an
> overall config problem.
>
> Just a working example:
>
> This sequence produces the following log and creates the empty folder
> "test":
>
> scala> val l = Seq.fill(1)(nextInt)
> scala> val dist = sc.parallelize(l)
> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/")
>
>
> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 1
> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
> :27
> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
> :27) with 2 output partitions (allowLocal=false)
> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at
> :27)
> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3
> (MapPartitionsRDD[7]
> at saveAsTextFile at :27), which has no missing parents
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
> curMem=184615, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 134.1 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
> curMem=321951, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as
> bytes
> in memory (estimated size 46.6 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo1.i3a.info:36330 (size: 46.6 KB, free: 265.3 MB)
> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
> broadcast_3_piece0
> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:839
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 3
> (MapPartitionsRDD[7] at saveAsTextFile at :27)
> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, nodo2.i3a.info, PROCESS_LOCAL, 25975 bytes)
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, nodo3.i3a.info, PROCESS_LOCAL, 25963 bytes)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo2.i3a.info:37759 (size: 46.6 KB, free: 530.2 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo3.i3a.info:54798 (size: 46.6 KB, free: 530.2 MB)
> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 312 ms on nodo2.i3a.info (1/2)
> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 313 ms on nodo3.i3a.info (2/2)
> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
> all completed, from pool
> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
> :27) finished in 0.334 s
> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
> :27, took 0.436388 s
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-creates-an-empty-folder-in-HDFS-tp24906.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Jacinto Arias
Yes printing the result with collect or take is working,

actually this is a minimal example, but also when working with real data the 
actions are performed, and the resulting RDDs can be printed out without 
problem. The data is there and the operations are correct, they just cannot be 
written to a file.


> On 03 Oct 2015, at 16:17, Ted Yu  > wrote:
> 
> bq.  val dist = sc.parallelize(l)
> 
> Following the above, can you call, e.g. count() on dist before saving ?
> 
> Cheers
> 
> On Fri, Oct 2, 2015 at 1:21 AM, jarias  > wrote:
> Dear list,
> 
> I'm experimenting a problem when trying to write any RDD to HDFS. I've tried
> with minimal examples, scala programs and pyspark programs both in local and
> cluster modes and as standalone applications or shells.
> 
> My problem is that when invoking the write command, a task is executed but
> it just creates an empty folder in the given HDFS path. I'm lost at this
> point because there is no sign of error or warning in the spark logs.
> 
> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
> working properly when using the command tools or running MapReduce jobs.
> 
> 
> Thank you for your time, I'm not sure if this is just a rookie mistake or an
> overall config problem.
> 
> Just a working example:
> 
> This sequence produces the following log and creates the empty folder
> "test":
> 
> scala> val l = Seq.fill(1)(nextInt)
> scala> val dist = sc.parallelize(l)
> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/ 
> ")
> 
> 
> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 1
> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
> :27
> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
> :27) with 2 output partitions (allowLocal=false)
> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at
> :27)
> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[7]
> at saveAsTextFile at :27), which has no missing parents
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
> curMem=184615, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 134.1 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
> curMem=321951, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
> in memory (estimated size 46.6 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo1.i3a.info:36330  (size: 46.6 KB, free: 
> 265.3 MB)
> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
> broadcast_3_piece0
> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:839
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3
> (MapPartitionsRDD[7] at saveAsTextFile at :27)
> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, nodo2.i3a.info , PROCESS_LOCAL, 25975 bytes)
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, nodo3.i3a.info , PROCESS_LOCAL, 25963 bytes)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo2.i3a.info:37759  (size: 46.6 KB, free: 
> 530.2 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo3.i3a.info:54798  (size: 46.6 KB, free: 
> 530.2 MB)
> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 312 ms on nodo2.i3a.info  (1/2)
> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 313 ms on nodo3.i3a.info  (2/2)
> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
> all completed, from pool
> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
> :27) finished in 0.334 s
> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
> :27, took 0.436388 s
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-creates-an-empty-folder-in-HDFS-tp24906.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> .
> 
> 

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ajay Chander
Hi Jacin,

If I was you, first thing that I would do is, write a sample java
application to write data into hdfs and see if it's working fine. Meta data
is being created in hdfs, that means, communication to namenode is working
fine but not to datanodes since you don't see any data inside the file. Why
don't you see hdfs logs and see what's happening when your application is
talking to namenode? I suspect some networking issue or check if the
datanodes are running fine.

Thank you,
Ajay

On Saturday, October 3, 2015, Jacinto Arias  wrote:

> Yes printing the result with collect or take is working,
>
> actually this is a minimal example, but also when working with real data
> the actions are performed, and the resulting RDDs can be printed out
> without problem. The data is there and the operations are correct, they
> just cannot be written to a file.
>
>
> On 03 Oct 2015, at 16:17, Ted Yu  > wrote:
>
> bq.  val dist = sc.parallelize(l)
>
> Following the above, can you call, e.g. count() on dist before saving ?
>
> Cheers
>
> On Fri, Oct 2, 2015 at 1:21 AM, jarias  > wrote:
>
>> Dear list,
>>
>> I'm experimenting a problem when trying to write any RDD to HDFS. I've
>> tried
>> with minimal examples, scala programs and pyspark programs both in local
>> and
>> cluster modes and as standalone applications or shells.
>>
>> My problem is that when invoking the write command, a task is executed but
>> it just creates an empty folder in the given HDFS path. I'm lost at this
>> point because there is no sign of error or warning in the spark logs.
>>
>> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
>> working properly when using the command tools or running MapReduce jobs.
>>
>>
>> Thank you for your time, I'm not sure if this is just a rookie mistake or
>> an
>> overall config problem.
>>
>> Just a working example:
>>
>> This sequence produces the following log and creates the empty folder
>> "test":
>>
>> scala> val l = Seq.fill(1)(nextInt)
>> scala> val dist = sc.parallelize(l)
>> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/")
>>
>>
>> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer
>> Algorithm
>> version is 1
>> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
>> :27
>> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
>> :27) with 2 output partitions (allowLocal=false)
>> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile
>> at
>> :27)
>> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
>> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
>> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3
>> (MapPartitionsRDD[7]
>> at saveAsTextFile at :27), which has no missing parents
>> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
>> curMem=184615, maxMem=278302556
>> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
>> memory (estimated size 134.1 KB, free 265.1 MB)
>> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
>> curMem=321951, maxMem=278302556
>> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as
>> bytes
>> in memory (estimated size 46.6 KB, free 265.1 MB)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo1.i3a.info:36330 (size: 46.6 KB, free: 265.3 MB)
>> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
>> broadcast_3_piece0
>> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
>> DAGScheduler.scala:839
>> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from
>> Stage 3
>> (MapPartitionsRDD[7] at saveAsTextFile at :27)
>> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
>> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
>> 6, nodo2.i3a.info, PROCESS_LOCAL, 25975 bytes)
>> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
>> 7, nodo3.i3a.info, PROCESS_LOCAL, 25963 bytes)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo2.i3a.info:37759 (size: 46.6 KB, free: 530.2 MB)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo3.i3a.info:54798 (size: 46.6 KB, free: 530.2 MB)
>> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
>> 6) in 312 ms on nodo2.i3a.info (1/2)
>> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
>> 7) in 313 ms on nodo3.i3a.info (2/2)
>> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks
>> have
>> all completed, from pool
>> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
>> :27) finished in 0.334 s
>> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 

Re: saveAsTextFile() part- files are missing

2015-05-21 Thread Tomasz Fruboes

Hi,

 it looks you are writing to a local filesystem. Could you try writing 
to a location visible by all nodes (master and workers), e.g. nfs share?


 HTH,
  Tomasz

W dniu 21.05.2015 o 17:16, rroxanaioana pisze:

Hello!
I just started with Spark. I have an application which counts words in a
file (1 MB file).
The file is stored locally. I loaded the file using native code and then
created the RDD from it.

 JavaRDDString rddFromFile = context.parallelize(myFile,
2);
JavaRDDString words = rddFromFile.flatMap(...);
JavaPairRDDString, Integer pairs = words.mapToPair(...);
JavaPairRDDString, Integer counter = pairs.reduceByKey(..);

counter.saveAsTextFile(file:///root/output);
context.close();

I have one master and 2 slaves. I run the program from the master node.
The output directory is created on the master node and on the 2 nodes. On
the master node I have only one file _SUCCES (empty) and on the nodes I have
_temporary file. I printed the counter at the console, the result seems ok.
What am I doing wrong?
Thank you!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-part-files-are-missing-tp22974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-16 Thread Ilya Ganelin
All - this issue showed up when I was tearing down a spark context and
creating a new one. Often, I was unable to then write to HDFS due to this
error. I subsequently switched to a different implementation where instead
of tearing down and re initializing the spark context I'd instead submit a
separate request to YARN.
On Fri, May 15, 2015 at 2:35 PM Puneet Kapoor puneet.cse.i...@gmail.com
wrote:

 I am seeing this on hadoop 2.4.0 version.

 Thanks for your suggestions, i will try those and let you know if they
 help !

 On Sat, May 16, 2015 at 1:57 AM, Steve Loughran ste...@hortonworks.com
 wrote:

  What version of Hadoop are you seeing this on?


  On 15 May 2015, at 20:03, Puneet Kapoor puneet.cse.i...@gmail.com
 wrote:

  Hey,

  Did you find any solution for this issue, we are seeing similar logs in
 our Data node logs. Appreciate any help.





  2015-05-15 10:51:43,615 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
 NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation
  src: /192.168.112.190:46253 dst: /192.168.151.104:50010
 java.net.SocketTimeoutException: 6 millis timeout while waiting for
 channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/192.168.151.104:50010
 remote=/192.168.112.190:46253]
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
 at java.io.BufferedInputStream.fill(Unknown Source)
 at java.io.BufferedInputStream.read1(Unknown Source)
 at java.io.BufferedInputStream.read(Unknown Source)
 at java.io.DataInputStream.read(Unknown Source)
 at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
 at
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
 at
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
 at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:742)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Unknown Source)


  That's being logged @ error level in DN. It doesn't mean the DN has
 crashed, only that it timed out waiting for data: something has gone wrong
 elsewhere.

  https://issues.apache.org/jira/browse/HDFS-693


 there's a couple of properties you can do to extend timeouts

   property

 namedfs.socket.timeout/name

 value2/value

 /property


 property

 namedfs.datanode.socket.write.timeout/name

 value2/value

 /property



 You can also increase the number of data node tranceiver threads to
 handle data IO across the network


 property
 namedfs.datanode.max.xcievers/name
 value4096/value
 /property

 Yes, that property has that explicit spellinng, it's easy to get wrong





Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-15 Thread Puneet Kapoor
Hey,

Did you find any solution for this issue, we are seeing similar logs in our
Data node logs. Appreciate any help.


2015-05-15 10:51:43,615 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation
 src: /192.168.112.190:46253 dst: /192.168.151.104:50010
java.net.SocketTimeoutException: 6 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.151.104:50010
remote=/192.168.112.190:46253]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.read(Unknown Source)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:742)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Unknown Source)

Thanks
Puneet

On Wed, Dec 3, 2014 at 2:50 AM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 Hi all, as the last stage of execution, I am writing out a dataset to disk. 
 Before I do this, I force the DAG to resolve so this is the only job left in 
 the pipeline. The dataset in question is not especially large (a few 
 gigabytes). During this step however, HDFS will inevitable crash. I will lose 
 connection to data-nodes and get stuck in the loop of death – failure causes 
 job restart, eventually causing the overall job to fail. On the data node 
 logs I see the errors below. Does anyone have any ideas as to what is going 
 on here? Thanks!


 java.io.IOException: Premature EOF from inputStream
   at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
   at 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:741)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:718)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:72)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
   at java.lang.Thread.run(Thread.java:745)




 innovationdatanode03.cof.ds.capitalone.com:1004:DataXceiver error processing 
 WRITE_BLOCK operation  src: /10.37.248.60:44676 dst: /10.37.248.59:1004
 java.net.SocketTimeoutException: 65000 millis timeout while waiting for 
 channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
 local=/10.37.248.59:43692 remote=/10.37.248.63:1004]
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
   at 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
   at 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
   at 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
   at java.io.FilterInputStream.read(FilterInputStream.java:83)
   at java.io.FilterInputStream.read(FilterInputStream.java:83)
   at 
 org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2101)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:660)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
   at 
 

Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-15 Thread Puneet Kapoor
I am seeing this on hadoop 2.4.0 version.

Thanks for your suggestions, i will try those and let you know if they help
!

On Sat, May 16, 2015 at 1:57 AM, Steve Loughran ste...@hortonworks.com
wrote:

  What version of Hadoop are you seeing this on?


  On 15 May 2015, at 20:03, Puneet Kapoor puneet.cse.i...@gmail.com
 wrote:

  Hey,

  Did you find any solution for this issue, we are seeing similar logs in
 our Data node logs. Appreciate any help.





  2015-05-15 10:51:43,615 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
 NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation
  src: /192.168.112.190:46253 dst: /192.168.151.104:50010
 java.net.SocketTimeoutException: 6 millis timeout while waiting for
 channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/192.168.151.104:50010
 remote=/192.168.112.190:46253]
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
 at java.io.BufferedInputStream.fill(Unknown Source)
 at java.io.BufferedInputStream.read1(Unknown Source)
 at java.io.BufferedInputStream.read(Unknown Source)
 at java.io.DataInputStream.read(Unknown Source)
 at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
 at
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
 at
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
 at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:742)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Unknown Source)


  That's being logged @ error level in DN. It doesn't mean the DN has
 crashed, only that it timed out waiting for data: something has gone wrong
 elsewhere.

  https://issues.apache.org/jira/browse/HDFS-693


 there's a couple of properties you can do to extend timeouts

   property

 namedfs.socket.timeout/name

 value2/value

 /property


 property

 namedfs.datanode.socket.write.timeout/name

 value2/value

 /property



 You can also increase the number of data node tranceiver threads to handle
 data IO across the network


 property
 namedfs.datanode.max.xcievers/name
 value4096/value
 /property

 Yes, that property has that explicit spellinng, it's easy to get wrong




Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
Another thing - could it be a permission problem ?
It creates all the directory structure (in red)/tmp/wordcount/
_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
so I am guessing not.

On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty njmu...@gmail.com wrote:

 You are most probably right. I assumed others may have run into this.
 When I try to put the files in there, it creates a directory structure
 with the part-0 and part1 files but these files are of size 0 - no
 content. The client error and the server logs have  the error message shown
 - which seem to indicate that the system is aware that a datanode exists
 but is excluded from the operation. So, it looks like it is not partitioned
 and Ambari indicates that HDFS is in good health with one NN, one SN, one
 DN.
 I am unable to figure out what the issue is.
 thanks for your help.

 On Tue, May 5, 2015 at 6:39 PM, ayan guha guha.a...@gmail.com wrote:

 What happens when you try to put files to your hdfs from local
 filesystem? Looks like its a hdfs issue rather than spark thing.
 On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:

 I have searched all replies to this question  not found an answer.

 I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
 side, on the same machine and trying to write output of wordcount program 
 into HDFS (works fine writing to a local file, /tmp/wordcount).

 Only line I added to the wordcount program is: (where 'counts' is the 
 JavaPairRDD)
 *counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
 http://sandbox.hortonworks.com:8020/tmp/wordcount);*

 When I check in HDFS at that location (/tmp) here's what I find.
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
 and
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1

 and *both part-000[01] are 0 size files*.

 The wordcount client output error is:
 [Stage 1:  (0 + 2) 
 / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
  *could only be replicated to 0 nodes instead of minReplication (=1).  
 There are 1 datanode(s) running and 1 node(s) are excluded in this 
 operation.*
 at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)


 I tried this with Spark 1.2.1 same error.
 I have plenty of space on the DFS.
 The Name Node, Sec Name Node  the one Data Node are all healthy.

 Any hint as to what may be the problem ?
 thanks in advance.
 Sudarshan


 --
 View this message in context: saveAsTextFile() to save output of Spark
 program to HDFS
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.





Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
You are most probably right. I assumed others may have run into this.
When I try to put the files in there, it creates a directory structure with
the part-0 and part1 files but these files are of size 0 - no
content. The client error and the server logs have  the error message shown
- which seem to indicate that the system is aware that a datanode exists
but is excluded from the operation. So, it looks like it is not partitioned
and Ambari indicates that HDFS is in good health with one NN, one SN, one
DN.
I am unable to figure out what the issue is.
thanks for your help.

On Tue, May 5, 2015 at 6:39 PM, ayan guha guha.a...@gmail.com wrote:

 What happens when you try to put files to your hdfs from local filesystem?
 Looks like its a hdfs issue rather than spark thing.
 On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:

 I have searched all replies to this question  not found an answer.

 I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
 side, on the same machine and trying to write output of wordcount program 
 into HDFS (works fine writing to a local file, /tmp/wordcount).

 Only line I added to the wordcount program is: (where 'counts' is the 
 JavaPairRDD)
 *counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
 http://sandbox.hortonworks.com:8020/tmp/wordcount);*

 When I check in HDFS at that location (/tmp) here's what I find.
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
 and
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1

 and *both part-000[01] are 0 size files*.

 The wordcount client output error is:
 [Stage 1:  (0 + 2) 
 / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
  *could only be replicated to 0 nodes instead of minReplication (=1).  There 
 are 1 datanode(s) running and 1 node(s) are excluded in this operation.*
  at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
  at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
  at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)


 I tried this with Spark 1.2.1 same error.
 I have plenty of space on the DFS.
 The Name Node, Sec Name Node  the one Data Node are all healthy.

 Any hint as to what may be the problem ?
 thanks in advance.
 Sudarshan


 --
 View this message in context: saveAsTextFile() to save output of Spark
 program to HDFS
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.




Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:


 I have searched all replies to this question  not found an answer.

 I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
 side, on the same machine and trying to write output of wordcount program 
 into HDFS (works fine writing to a local file, /tmp/wordcount).

 Only line I added to the wordcount program is: (where 'counts' is the 
 JavaPairRDD)
 *counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
 http://sandbox.hortonworks.com:8020/tmp/wordcount);*

 When I check in HDFS at that location (/tmp) here's what I find.
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
 and
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1

 and *both part-000[01] are 0 size files*.

 The wordcount client output error is:
 [Stage 1:  (0 + 2) / 
 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
  *could only be replicated to 0 nodes instead of minReplication (=1).  There 
 are 1 datanode(s) running and 1 node(s) are excluded in this operation.*
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)


 I tried this with Spark 1.2.1 same error.
 I have plenty of space on the DFS.
 The Name Node, Sec Name Node  the one Data Node are all healthy.

 Any hint as to what may be the problem ?
 thanks in advance.
 Sudarshan


 --
 View this message in context: saveAsTextFile() to save output of Spark
 program to HDFS
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Sean. I want to load each batch into Redshift. What's the best/most 
efficient way to do that?

Vadim


 On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote:
 
 You can't, since that's how it's designed to work. Batches are saved
 in different files, which are really directories containing
 partitions, as is common in Hadoop. You can move them later, or just
 read them where they are.
 
 On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
 vadim.bichuts...@gmail.com wrote:
 I am using Spark Streaming where during each micro-batch I output data to S3
 using
 saveAsTextFile. Right now each batch of data is put into its own directory
 containing
 2 objects, _SUCCESS and part-0.
 
 How do I output each batch into a common directory?
 
 Thanks,
 Vadim
 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
The reason for this is as follows:

 

1.   You are saving data on HDFS

2.   HDFS as a cluster/server side Service has a Single Writer / Multiple 
Reader multithreading model 

3.   Hence each thread of execution in Spark has to write to a separate 
file in HDFS

4.   Moreover the RDDs are partitioned across cluster nodes and operated 
upon by multiple threads there and on top of that in Spark Streaming you have 
many micro-batch RDDs streaming in all the time as part of a DStream  

 

If you want fine / detailed management of the writing to HDFS you can implement 
your own HDFS adapter and invoke it in forEachRDD and foreach 

 

Regards

Evo Eftimov  

 

From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:33 PM
To: user@spark.apache.org
Subject: saveAsTextFile

 

I am using Spark Streaming where during each micro-batch I output data to S3 
using

saveAsTextFile. Right now each batch of data is put into its own directory 
containing

2 objects, _SUCCESS and part-0.

 

How do I output each batch into a common directory?

 

Thanks,

Vadim

  
https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=057349bb-29a2-4296-82b7-c52b46ae19f6
 ᐧ

  
http://t.signauxcinq.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v5dsrxW7gbZX-8q-6ZdVdnPvF2zlZNzW3hF9wD1k1H6H0?si=5533377798602752pi=ff283f35-99c4-4b15-dd07-91df78970bf8
 



RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier 

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, April 16, 2015 6:35 PM
To: Vadim Bichutskiy
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile

You can't, since that's how it's designed to work. Batches are saved in 
different files, which are really directories containing partitions, as is 
common in Hadoop. You can move them later, or just read them where they are.

On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com 
wrote:
 I am using Spark Streaming where during each micro-batch I output data 
 to S3 using saveAsTextFile. Right now each batch of data is put into 
 its own directory containing
 2 objects, _SUCCESS and part-0.

 How do I output each batch into a common directory?

 Thanks,
 Vadim
 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them 
wherever you want - Use foreacPartition and then foreach 

-Original Message-
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:39 PM
To: Sean Owen
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile

Thanks Sean. I want to load each batch into Redshift. What's the best/most 
efficient way to do that?

Vadim


 On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote:
 
 You can't, since that's how it's designed to work. Batches are saved 
 in different files, which are really directories containing 
 partitions, as is common in Hadoop. You can move them later, or just 
 read them where they are.
 
 On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:
 I am using Spark Streaming where during each micro-batch I output 
 data to S3 using saveAsTextFile. Right now each batch of data is put 
 into its own directory containing
 2 objects, _SUCCESS and part-0.
 
 How do I output each batch into a common directory?
 
 Thanks,
 Vadim
 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-04-16 Thread Sean Owen
Just copy the files? it shouldn't matter that much where they are as
you can find them easily. Or consider somehow sending the batches of
data straight into Redshift? no idea how that is done but I imagine
it's doable.

On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
 Thanks Sean. I want to load each batch into Redshift. What's the best/most 
 efficient way to do that?

 Vadim


 On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote:

 You can't, since that's how it's designed to work. Batches are saved
 in different files, which are really directories containing
 partitions, as is common in Hadoop. You can move them later, or just
 read them where they are.

 On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
 vadim.bichuts...@gmail.com wrote:
 I am using Spark Streaming where during each micro-batch I output data to S3
 using
 saveAsTextFile. Right now each batch of data is put into its own directory
 containing
 2 objects, _SUCCESS and part-0.

 How do I output each batch into a common directory?

 Thanks,
 Vadim
 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Also to juggle even further the multithreading model of both spark and HDFS you 
can even publish the data from spark first to a message broker e.g. kafka from 
where a predetermined number (from 1 to infinity) of parallel consumers will 
retrieve and store in HDFS in one or more finely controlled files and 
directories  

 

From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:45 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile

 

Thanks Evo for your detailed explanation.


On Apr 16, 2015, at 1:38 PM, Evo Eftimov evo.efti...@isecc.com wrote:

The reason for this is as follows:

 

1.  You are saving data on HDFS

2.  HDFS as a cluster/server side Service has a Single Writer / Multiple 
Reader multithreading model 

3.  Hence each thread of execution in Spark has to write to a separate file 
in HDFS

4.  Moreover the RDDs are partitioned across cluster nodes and operated 
upon by multiple threads there and on top of that in Spark Streaming you have 
many micro-batch RDDs streaming in all the time as part of a DStream  

 

If you want fine / detailed management of the writing to HDFS you can implement 
your own HDFS adapter and invoke it in forEachRDD and foreach 

 

Regards

Evo Eftimov  

 

From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:33 PM
To: user@spark.apache.org
Subject: saveAsTextFile

 

I am using Spark Streaming where during each micro-batch I output data to S3 
using

saveAsTextFile. Right now each batch of data is put into its own directory 
containing

2 objects, _SUCCESS and part-0.

 

How do I output each batch into a common directory?

 

Thanks,

Vadim

  
https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=057349bb-29a2-4296-82b7-c52b46ae19f6
 ᐧ

  
http://t.signauxcinq.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v5dsrxW7gbZX-8q-6ZdVdnPvF2zlZNzW3hF9wD1k1H6H0?si=5533377798602752pi=ff283f35-99c4-4b15-dd07-91df78970bf8
 



Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Copy should be doable but I'm not sure how to specify a prefix for the 
directory while keeping the filename (ie part-0) fixed in copy command.



 On Apr 16, 2015, at 1:51 PM, Sean Owen so...@cloudera.com wrote:
 
 Just copy the files? it shouldn't matter that much where they are as
 you can find them easily. Or consider somehow sending the batches of
 data straight into Redshift? no idea how that is done but I imagine
 it's doable.
 
 On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy
 vadim.bichuts...@gmail.com wrote:
 Thanks Sean. I want to load each batch into Redshift. What's the best/most 
 efficient way to do that?
 
 Vadim
 
 
 On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote:
 
 You can't, since that's how it's designed to work. Batches are saved
 in different files, which are really directories containing
 partitions, as is common in Hadoop. You can move them later, or just
 read them where they are.
 
 On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
 vadim.bichuts...@gmail.com wrote:
 I am using Spark Streaming where during each micro-batch I output data to 
 S3
 using
 saveAsTextFile. Right now each batch of data is put into its own directory
 containing
 2 objects, _SUCCESS and part-0.
 
 How do I output each batch into a common directory?
 
 Thanks,
 Vadim
 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-04-16 Thread Sean Owen
You can't, since that's how it's designed to work. Batches are saved
in different files, which are really directories containing
partitions, as is common in Hadoop. You can move them later, or just
read them where they are.

On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
 I am using Spark Streaming where during each micro-batch I output data to S3
 using
 saveAsTextFile. Right now each batch of data is put into its own directory
 containing
 2 objects, _SUCCESS and part-0.

 How do I output each batch into a common directory?

 Thanks,
 Vadim
 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Evo for your detailed explanation.

 On Apr 16, 2015, at 1:38 PM, Evo Eftimov evo.efti...@isecc.com wrote:
 
 The reason for this is as follows:
  
 1.   You are saving data on HDFS
 2.   HDFS as a cluster/server side Service has a Single Writer / Multiple 
 Reader multithreading model
 3.   Hence each thread of execution in Spark has to write to a separate 
 file in HDFS
 4.   Moreover the RDDs are partitioned across cluster nodes and operated 
 upon by multiple threads there and on top of that in Spark Streaming you have 
 many micro-batch RDDs streaming in all the time as part of a DStream  
  
 If you want fine / detailed management of the writing to HDFS you can 
 implement your own HDFS adapter and invoke it in forEachRDD and foreach
  
 Regards
 Evo Eftimov  
  
 From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
 Sent: Thursday, April 16, 2015 6:33 PM
 To: user@spark.apache.org
 Subject: saveAsTextFile
  
 I am using Spark Streaming where during each micro-batch I output data to S3 
 using
 saveAsTextFile. Right now each batch of data is put into its own directory 
 containing
 2 objects, _SUCCESS and part-0.
  
 How do I output each batch into a common directory?
  
 Thanks,
 Vadim
 ᐧ


Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid
is your data skewed?  Could it be that there are a few keys with a huge
number of records?  You might consider outputting
(recordA, count)
(recordB, count)

instead of

recordA
recordA
recordA
...


you could do this with:

input = sc.textFile
pairsCounts = input.map{x = (x,1)}.reduceByKey{_ + _}
sorted = pairs.sortByKey
sorted.saveAsTextFile


On Mon, Mar 9, 2015 at 12:31 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Akhil Das
Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Sean Owen
This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: saveAsTextFile of RDD[Array[Any]]

2015-02-09 Thread Jong Wook Kim
If you have `RDD[Array[Any]]` you can do

rdd.map(_.mkString(\t))

or with some other delimiter to make it `RDD[String]`, and then call
`saveAsTextFile`.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p21554.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Nick Pentreath
Your output folder specifies

rdd.saveAsTextFile(s3n://nexgen-software/dev/output);

So it will try to write to /dev/output which is as expected. If you create
the directory /dev/output upfront in your bucket, and try to save it to
that (empty) directory, what is the behaviour?

On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin kevin.c...@neustar.biz wrote:

  Does anyone know if I can save a RDD as a text file to a pre-created
 directory in S3 bucket?

  I have a directory created in S3 bucket: //nexgen-software/dev

  When I tried to save a RDD as text file in this directory:
 rdd.saveAsTextFile(s3n://nexgen-software/dev/output);


  I got following exception at runtime:

 Exception in thread main org.apache.hadoop.fs.s3.S3Exception:
 org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
 ResponseCode=403, ResponseMessage=Forbidden


  I have verified /dev has write permission. However, if I grant the
 bucket //nexgen-software write permission, I don't get exception. But the
 output is not created under dev. Rather, a different /dev/output directory
 is created directory in the bucket (//nexgen-software). Is this how
 saveAsTextFile behalves in S3? Is there anyway I can have output created
 under a pre-defied directory.


  Thanks in advance.







Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
When spark saves rdd to a text file, the directory must not exist upfront. It 
will create a directory and write the data to part- under that directory. 
In my use case, I create a directory dev in the bucket ://nexgen-software/dev . 
I expect it creates output direct under dev and a part- under output. But 
it gave me exception as I only give write permission to dev not the bucket. If 
I open up write permission to bucket, it worked. But it did not create output 
directory under dev, it rather creates another dev/output directory under 
bucket. I just want to know if it is possible to have output directory created 
under dev directory I created upfront.

From: Nick Pentreath nick.pentre...@gmail.commailto:nick.pentre...@gmail.com
Date: Monday, January 26, 2015 9:15 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SaveAsTextFile to S3 bucket

Your output folder specifies

rdd.saveAsTextFile(s3n://nexgen-software/dev/output);

So it will try to write to /dev/output which is as expected. If you create the 
directory /dev/output upfront in your bucket, and try to save it to that 
(empty) directory, what is the behaviour?

On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin 
kevin.c...@neustar.bizmailto:kevin.c...@neustar.biz wrote:
Does anyone know if I can save a RDD as a text file to a pre-created directory 
in S3 bucket?

I have a directory created in S3 bucket: //nexgen-software/dev

When I tried to save a RDD as text file in this directory:
rdd.saveAsTextFile(s3n://nexgen-software/dev/output);


I got following exception at runtime:

Exception in thread main org.apache.hadoop.fs.s3.S3Exception: 
org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - 
ResponseCode=403, ResponseMessage=Forbidden


I have verified /dev has write permission. However, if I grant the bucket 
//nexgen-software write permission, I don't get exception. But the output is 
not created under dev. Rather, a different /dev/output directory is created 
directory in the bucket (//nexgen-software). Is this how saveAsTextFile 
behalves in S3? Is there anyway I can have output created under a pre-defied 
directory.


Thanks in advance.






Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Ashish Rangole
By default, the files will be created under the path provided as the
argument for saveAsTextFile. This argument is considered as a folder in the
bucket and actual files are created in it with the naming convention
part-n, where n is the number of output partition.

On Mon, Jan 26, 2015 at 9:15 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Your output folder specifies

 rdd.saveAsTextFile(s3n://nexgen-software/dev/output);

 So it will try to write to /dev/output which is as expected. If you create
 the directory /dev/output upfront in your bucket, and try to save it to
 that (empty) directory, what is the behaviour?

 On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin kevin.c...@neustar.biz
 wrote:

  Does anyone know if I can save a RDD as a text file to a pre-created
 directory in S3 bucket?

  I have a directory created in S3 bucket: //nexgen-software/dev

  When I tried to save a RDD as text file in this directory:
 rdd.saveAsTextFile(s3n://nexgen-software/dev/output);


  I got following exception at runtime:

 Exception in thread main org.apache.hadoop.fs.s3.S3Exception:
 org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
 ResponseCode=403, ResponseMessage=Forbidden


  I have verified /dev has write permission. However, if I grant the
 bucket //nexgen-software write permission, I don't get exception. But the
 output is not created under dev. Rather, a different /dev/output directory
 is created directory in the bucket (//nexgen-software). Is this how
 saveAsTextFile behalves in S3? Is there anyway I can have output created
 under a pre-defied directory.


  Thanks in advance.








Re: saveAsTextFile

2015-01-15 Thread ankits
I have seen this happen when the RDD contains null values. Essentially,
saveAsTextFile calls toString() on the elements of the RDD, so a call to
null.toString will result in an NPE.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-01-15 Thread Prannoy
Hi,

Before saving the rdd do a collect to the rdd and print the content of the
rdd. Probably its a null value.

Thanks.

On Sat, Jan 3, 2015 at 5:37 PM, Pankaj Narang [via Apache Spark User List] 
ml-node+s1001560n20953...@n3.nabble.com wrote:

 If you can paste the code here I can certainly help.

 Also confirm the version of spark you are using

 Regards
 Pankaj
 Infoshore Software
 India

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Reynold Xin
It is just calling RDD's saveAsTextFile. I guess we should really override
the saveAsTextFile in SchemaRDD (or make Row.toString comma separated).

Do you mind filing a JIRA ticket and copy me?


On Tue, Jan 13, 2015 at 12:03 AM, Kevin Burton bur...@spinn3r.com wrote:

 This is almost funny.

 I want to dump a computation to the filesystem.  It’s just the result of a
 Spark SQL call reading the data from Cassandra.

 The problem is that it looks like it’s just calling toString() which is
 useless.

 The example is below.

 I assume this is just a (bad) bug.

 org.apache.spark.sql.api.java.Row@37f108

 org.apache.spark.sql.api.java.Row@d0426773

 org.apache.spark.sql.api.java.Row@38c9d3


 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang
If you can paste the code here I can certainly help.

Also confirm the version of spark you are using

Regards
Pankaj 
Infoshore Software 
India



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-01-03 Thread Sanjay Subramanian

@lailaBased on the error u mentioned in the nabble link below, it seems like 
there are no permissions to write to HDFS. So this is possibly why 
saveAsTextFile is failing.

  From: Pankaj Narang pankajnaran...@gmail.com
 To: user@spark.apache.org 
 Sent: Saturday, January 3, 2015 4:07 AM
 Subject: Re: saveAsTextFile
   
If you can paste the code here I can certainly help.

Also confirm the version of spark you are using

Regards
Pankaj 
Infoshore Software 
India



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

Re: saveAsTextFile error

2014-11-15 Thread Prannoy
Hi Niko, 

Have you tried it running keeping the wordCounts.print() ?? Possibly the
import to the package *org.apache.spark.streaming._* is not there so during
sbt package it is unable to locate the saveAsTextFile API. 

Go to
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
to check if all the needed packages are there. 

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-error-tp18960p19006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile error

2014-11-14 Thread Harold Nguyen
Hi Niko,

It looks like you are calling a method on DStream, which does not exist.

Check out:
https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams

for the method saveAsTextFiles

Harold

On Fri, Nov 14, 2014 at 10:39 AM, Niko Gamulin niko.gamu...@gmail.com
wrote:

 Hi,

 I tried to modify NetworkWordCount example in order to save the output to
 a file.

 In NetworkWordCount.scala I replaced the line

 wordCounts.print()
 with
 wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt)

 When I ran sbt/sbt package it returned the following error:

 [error]
 /home/bart/spark-1.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCountModified.scala:57:
 value saveAsTextFile is not a member of
 org.apache.spark.streaming.dstream.DStream[(String, Int)]
 [error]
 wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt)
 [error]^
 [error] one error found
 [error] (examples/compile:compile) Compilation failed
 [error] Total time: 5 s, completed Nov 14, 2014 9:38:53 PM

 Does anyone know why this error occurs?
 Is there any other way to save the results to a text file?

 Regards,

 Niko







Re: saveAsTextFile makes no progress without caching RDD

2014-09-02 Thread jerryye
As an update. I'm still getting the same issue. I ended up doing a coalesce
instead of a cache to get around the memory issue but saveAsTextFile still
won't proceed without the coalesce or cache first.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-makes-no-progress-without-caching-RDD-tp12613p13283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
Hi David, 

Your job is probably hanging on the groupByKey process. Probably GC is kicking 
in and the process starts to hang or the data is unbalanced and you end up with 
stragglers (Once GC kicks in you'll start to get the connection errors you 
shared). If you don't care about the list of values itself, but the count of it 
(that appears to be what you're trying to save, correct me if I'm wrong), then 
I would suggest using `countByKey()` directly on 
`JavaPairRDDString, AnalyticsLogFlyweight partitioned`. 

Best, 
Burak 

- Original Message -

From: David david.b...@gmail.com 
To: user u...@spark.incubator.apache.org 
Sent: Tuesday, August 19, 2014 1:44:18 PM 
Subject: saveAsTextFile hangs with hdfs 

I have a simple spark job that seems to hang when saving to hdfs. When looking 
at the spark web ui, the job reached 97 of 100 tasks completed. I need some 
help determining why the job appears to hang. The job hangs on the 
saveAsTextFile() call. 


https://www.dropbox.com/s/fdp7ck91hhm9w68/Screenshot%202014-08-19%2010.53.24.png
 

The job is pretty simple: 

JavaRDDString analyticsLogs = context 
.textFile(Joiner.on(,).join(hdfs.glob(/spark-dfs, .*\\.log$)), 4); 

JavaRDDAnalyticsLogFlyweight flyweights = analyticsLogs 
.map(line - { 
try { 
AnalyticsLog log = GSON.fromJson(line, AnalyticsLog.class); 
AnalyticsLogFlyweight flyweight = new AnalyticsLogFlyweight(); 
flyweight.ipAddress = log.getIpAddress(); 
flyweight.time = log.getTime(); 
flyweight.trackingId = log.getTrackingId(); 
return flyweight; 

} catch (Exception e) { 
LOG.error(error parsing json, e); 
return null; 
} 
}); 


JavaRDDAnalyticsLogFlyweight filtered = flyweights 
.filter(log - log != null); 

JavaPairRDDString, AnalyticsLogFlyweight partitioned = filtered 
.mapToPair((AnalyticsLogFlyweight log) - new Tuple2(log.trackingId, log)) 
.partitionBy(new HashPartitioner(100)).cache(); 


OrderingAnalyticsLogFlyweight ordering = 
Ordering.natural().nullsFirst().onResultOf(new FunctionAnalyticsLogFlyweight, 
Long() { 
public Long apply(AnalyticsLogFlyweight log) { 
return log.time; 
} 
}); 

JavaPairRDDString, IterableAnalyticsLogFlyweight stringIterableJavaPairRDD 
= partitioned.groupByKey(); 
JavaPairRDDString, Integer stringIntegerJavaPairRDD = 
stringIterableJavaPairRDD.mapToPair((log) - { 
ListAnalyticsLogFlyweight sorted = Lists.newArrayList(log._2()); 
sorted.forEach(l - LOG.info(sorted {}, l)); 
return new Tuple2(log._1(), sorted.size()); 
}); 

String outputPath = /summarized/groupedByTrackingId4; 
hdfs.rm(outputPath, true); 
stringIntegerJavaPairRDD.saveAsTextFile(String.format(%s/%s, hdfs.getUrl(), 
outputPath)); 


Thanks in advance, David 



Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
update: hangs even when not writing to hdfs.  I changed the code to avoid
saveAsTextFile() and instead do a forEachParitition and log the results. 
This time it hangs at 96/100 tasks, but still hangs.



I changed the saveAsTextFile to:

 stringIntegerJavaPairRDD.foreachPartition(p - {
while (p.hasNext()) {
   LOG.info({}, p.next());
}
});

Thanks, David.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-hangs-with-hdfs-tp12412p12419.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
Not sure if this is helpful or not, but in one executor stderr log, I found
this:

14/08/19 20:17:04 INFO CacheManager: Partition rdd_5_14 not found, computing
it
14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 16251 non-empty blocks out of 25435 blocks
14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 123 ms
14/08/19 20:34:00 INFO SendingConnection: Initiating connection to
[localhost/127.0.0.1:39840]
14/08/19 20:34:00 WARN SendingConnection: Error finishing connection to
localhost/127.0.0.1:39840
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:712)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14/08/19 20:34:00 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(localhost,39840)
14/08/19 20:34:00 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(localhost,39840)
14/08/19 20:34:08 INFO SendingConnection: Initiating connection to
[localhost/127.0.0.1:39840]
14/08/19 20:34:08 WARN SendingConnection: Error finishing connection to
localhost/127.0.0.1:39840
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:712)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14/08/19 20:34:08 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(localhost,39840)
14/08/19 20:34:08 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(localhost,39840)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-hangs-with-hdfs-tp12412p12420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2014-08-10 Thread durin
This should work:
jobs.saveAsTextFile(file:home/hysom/testing) 

Note the 4 slashes, it's really 3 slashes + absolute path.

This should be mentioned in the docu though, I only remember that from
having seen it somewhere else.
The output folder, here testing, will be created and must therefore not
exist before.


Best regards, Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp11803p11846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org