Re: saveAsTextFile at treeEnsembleModels.scala:447, took 2.513396 s Killed

2016-07-28 Thread Ascot Moss
Hi,

Thanks for your reply.

permissions (access) is not an issue in my case, it is because this issue
only happened when the bigger input file was used to generate the model,
i.e. with smaller input(s) all worked well.   It seems to me that ".save"
cannot save big file.

Q1: Any idea about the size  limit that ".save" can handle?
Q2: Any idea about how to check the size model that will be saved vis
".save" ?

Regards



On Thu, Jul 28, 2016 at 4:19 PM, Spico Florin  wrote:

> Hi!
>   There are many reasons that your task is failed. One could be that you
> don't have proper permissions (access) to  hdfs with your user. Please
> check your user rights to write in hdfs. Please have a look also :
>
> http://stackoverflow.com/questions/27427042/spark-unable-to-save-in-hadoop-permission-denied-for-user
> I hope it jelps.
>  Florin
>
>
> On Thu, Jul 28, 2016 at 3:49 AM, Ascot Moss  wrote:
>
>>
>> Hi,
>>
>> Please help!
>>
>> When saving the model, I got following error and cannot save the model to
>> hdfs:
>>
>> (my source code, my spark is v1.6.2)
>> my_model.save(sc, "/my_model")
>>
>> -
>> 16/07/28 08:36:19 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose
>> tasks have all completed, from pool
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: ResultStage 69 (saveAsTextFile at
>> treeEnsembleModels.scala:447) finished in 0.901 s
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: Job 38 finished: saveAsTextFile at
>> treeEnsembleModels.scala:447, took 2.513396 s
>>
>> Killed
>> -
>>
>>
>> Q1: Is there any limitation on saveAsTextFile?
>> Q2: or where to find the error log file location?
>>
>> Regards
>>
>>
>>
>>
>>
>


RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
If the data is not too big, one option is to call the collect method and then 
save the result to a local file using standard Java/Scala API. However, keep in 
mind that this will transfer data from all the worker nodes to the driver 
program. Looks like that is what you want to do anyway, but you need to be 
aware of how big that data is and related implications.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Monday, February 1, 2016 6:00 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohamed,

Thanks for your response. Data is available in worker nodes. But looking for 
something to write directly to local fs. Seems like it is not an option.

Thanks,
Sivakumar Bhavanari.

On Mon, Feb 1, 2016 at 5:45 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
You should not be saving an RDD to local FS if Spark is running on a real 
cluster. Essentially, each Spark worker will save the partitions that it 
processes locally.

Check the directories on the worker nodes and you should find pieces of your 
file on each node.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 2016 5:40 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in 
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG mode. 
I see the below exception, but this exception occurred after saveAsTextfile 
function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 2016 3:38 PM
To: spark users
Subject: saveAsTextFile is not w

Re: saveAsTextFile is not writing to local fs

2016-02-01 Thread Siva
Hi Mohamed,

Thanks for your response. Data is available in worker nodes. But looking
for something to write directly to local fs. Seems like it is not an option.

Thanks,
Sivakumar Bhavanari.

On Mon, Feb 1, 2016 at 5:45 PM, Mohammed Guller 
wrote:

> You should not be saving an RDD to local FS if Spark is running on a real
> cluster. Essentially, each Spark worker will save the partitions that it
> processes locally.
>
>
>
> Check the directories on the worker nodes and you should find pieces of
> your file on each node.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 5:40 PM
> *To:* Mohammed Guller
> *Cc:* spark users
> *Subject:* Re: saveAsTextFile is not writing to local fs
>
>
>
> Hi Mohammed,
>
>
>
> Thanks for your quick response. I m submitting spark job to Yarn in
> "yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG
> mode. I see the below exception, but this exception occurred after
> saveAsTextfile function is finished.
>
>
>
> 16/01/29 20:26:57 DEBUG HttpParser:
>
> java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:190)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>
> at
> org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
>
> at
> org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
>
> at
> org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
>
> at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
>
> at
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>
> at
> org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 16/01/29 20:26:57 DEBUG HttpParser:
>
> java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:190)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>
> at
> org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
>
> at
> org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
>
> at
> org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
>
> at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
>
> at
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>
> at
> org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
>
> org.spark-project.jetty.io.EofException
>
>
>
> Do you think this one this causing this?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>
>
>
> On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
> wrote:
>
> Is it a multi-node cluster or you running Spark on a single machine?
>
>
>
> You can change Spark’s logging level to INFO or DEBUG to see what is going
> on.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 3:38 PM
> *To:* spark users
> *Subject:* saveAsTextFile is not writing to local fs
>
>
>
> Hi Everyone,
>
>
>
> We are using spark 1.4.1 and we have a requirement of writing data local
> fs instead of hdfs.
>
>
>
> When trying to save rdd to local fs with saveAsTextFile, it is just
> writing _SUCCESS file in the folder with no part- files and also no error
> or warning messages on console.
>
>
>
> Is there any place to look at to fix this problem?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>
>
>


RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
You should not be saving an RDD to local FS if Spark is running on a real 
cluster. Essentially, each Spark worker will save the partitions that it 
processes locally.

Check the directories on the worker nodes and you should find pieces of your 
file on each node.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Friday, January 29, 2016 5:40 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in 
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG mode. 
I see the below exception, but this exception occurred after saveAsTextfile 
function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 2016 3:38 PM
To: spark users
Subject: saveAsTextFile is not writing to local fs

Hi Everyone,

We are using spark 1.4.1 and we have a requirement of writing data local fs 
instead of hdfs.

When trying to save rdd to local fs with saveAsTextFile, it is just writing 
_SUCCESS file in the folder with no part- files and also no error or warning 
messages on console.

Is there any place to look at to fix this problem?

Thanks,
Sivakumar Bhavanari.



Re: saveAsTextFile is not writing to local fs

2016-01-29 Thread Siva
Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG
mode. I see the below exception, but this exception occurred after
saveAsTextfile function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at
org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at
org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
wrote:

> Is it a multi-node cluster or you running Spark on a single machine?
>
>
>
> You can change Spark’s logging level to INFO or DEBUG to see what is going
> on.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 3:38 PM
> *To:* spark users
> *Subject:* saveAsTextFile is not writing to local fs
>
>
>
> Hi Everyone,
>
>
>
> We are using spark 1.4.1 and we have a requirement of writing data local
> fs instead of hdfs.
>
>
>
> When trying to save rdd to local fs with saveAsTextFile, it is just
> writing _SUCCESS file in the folder with no part- files and also no error
> or warning messages on console.
>
>
>
> Is there any place to look at to fix this problem?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>


RE: saveAsTextFile is not writing to local fs

2016-01-29 Thread Mohammed Guller
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Friday, January 29, 2016 3:38 PM
To: spark users
Subject: saveAsTextFile is not writing to local fs

Hi Everyone,

We are using spark 1.4.1 and we have a requirement of writing data local fs 
instead of hdfs.

When trying to save rdd to local fs with saveAsTextFile, it is just writing 
_SUCCESS file in the folder with no part- files and also no error or warning 
messages on console.

Is there any place to look at to fix this problem?

Thanks,
Sivakumar Bhavanari.


Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ajay Chander
Hi Jacin,

If I was you, first thing that I would do is, write a sample java
application to write data into hdfs and see if it's working fine. Meta data
is being created in hdfs, that means, communication to namenode is working
fine but not to datanodes since you don't see any data inside the file. Why
don't you see hdfs logs and see what's happening when your application is
talking to namenode? I suspect some networking issue or check if the
datanodes are running fine.

Thank you,
Ajay

On Saturday, October 3, 2015, Jacinto Arias  wrote:

> Yes printing the result with collect or take is working,
>
> actually this is a minimal example, but also when working with real data
> the actions are performed, and the resulting RDDs can be printed out
> without problem. The data is there and the operations are correct, they
> just cannot be written to a file.
>
>
> On 03 Oct 2015, at 16:17, Ted Yu  > wrote:
>
> bq.  val dist = sc.parallelize(l)
>
> Following the above, can you call, e.g. count() on dist before saving ?
>
> Cheers
>
> On Fri, Oct 2, 2015 at 1:21 AM, jarias  > wrote:
>
>> Dear list,
>>
>> I'm experimenting a problem when trying to write any RDD to HDFS. I've
>> tried
>> with minimal examples, scala programs and pyspark programs both in local
>> and
>> cluster modes and as standalone applications or shells.
>>
>> My problem is that when invoking the write command, a task is executed but
>> it just creates an empty folder in the given HDFS path. I'm lost at this
>> point because there is no sign of error or warning in the spark logs.
>>
>> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
>> working properly when using the command tools or running MapReduce jobs.
>>
>>
>> Thank you for your time, I'm not sure if this is just a rookie mistake or
>> an
>> overall config problem.
>>
>> Just a working example:
>>
>> This sequence produces the following log and creates the empty folder
>> "test":
>>
>> scala> val l = Seq.fill(1)(nextInt)
>> scala> val dist = sc.parallelize(l)
>> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/")
>>
>>
>> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer
>> Algorithm
>> version is 1
>> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
>> :27
>> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
>> :27) with 2 output partitions (allowLocal=false)
>> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile
>> at
>> :27)
>> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
>> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
>> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3
>> (MapPartitionsRDD[7]
>> at saveAsTextFile at :27), which has no missing parents
>> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
>> curMem=184615, maxMem=278302556
>> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
>> memory (estimated size 134.1 KB, free 265.1 MB)
>> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
>> curMem=321951, maxMem=278302556
>> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as
>> bytes
>> in memory (estimated size 46.6 KB, free 265.1 MB)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo1.i3a.info:36330 (size: 46.6 KB, free: 265.3 MB)
>> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
>> broadcast_3_piece0
>> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
>> DAGScheduler.scala:839
>> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from
>> Stage 3
>> (MapPartitionsRDD[7] at saveAsTextFile at :27)
>> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
>> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
>> 6, nodo2.i3a.info, PROCESS_LOCAL, 25975 bytes)
>> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
>> 7, nodo3.i3a.info, PROCESS_LOCAL, 25963 bytes)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo2.i3a.info:37759 (size: 46.6 KB, free: 530.2 MB)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo3.i3a.info:54798 (size: 46.6 KB, free: 530.2 MB)
>> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
>> 6) in 312 ms on nodo2.i3a.info (1/2)
>> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
>> 7) in 313 ms on nodo3.i3a.info (2/2)
>> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks
>> have
>> all completed, from pool
>> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
>> :27) finished in 0.334 s
>> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
>> :27, took 0.436388 s
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFi

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Jacinto Arias
Yes printing the result with collect or take is working,

actually this is a minimal example, but also when working with real data the 
actions are performed, and the resulting RDDs can be printed out without 
problem. The data is there and the operations are correct, they just cannot be 
written to a file.


> On 03 Oct 2015, at 16:17, Ted Yu  > wrote:
> 
> bq.  val dist = sc.parallelize(l)
> 
> Following the above, can you call, e.g. count() on dist before saving ?
> 
> Cheers
> 
> On Fri, Oct 2, 2015 at 1:21 AM, jarias  > wrote:
> Dear list,
> 
> I'm experimenting a problem when trying to write any RDD to HDFS. I've tried
> with minimal examples, scala programs and pyspark programs both in local and
> cluster modes and as standalone applications or shells.
> 
> My problem is that when invoking the write command, a task is executed but
> it just creates an empty folder in the given HDFS path. I'm lost at this
> point because there is no sign of error or warning in the spark logs.
> 
> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
> working properly when using the command tools or running MapReduce jobs.
> 
> 
> Thank you for your time, I'm not sure if this is just a rookie mistake or an
> overall config problem.
> 
> Just a working example:
> 
> This sequence produces the following log and creates the empty folder
> "test":
> 
> scala> val l = Seq.fill(1)(nextInt)
> scala> val dist = sc.parallelize(l)
> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/ 
> ")
> 
> 
> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 1
> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
> :27
> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
> :27) with 2 output partitions (allowLocal=false)
> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at
> :27)
> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[7]
> at saveAsTextFile at :27), which has no missing parents
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
> curMem=184615, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 134.1 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
> curMem=321951, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
> in memory (estimated size 46.6 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo1.i3a.info:36330  (size: 46.6 KB, free: 
> 265.3 MB)
> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
> broadcast_3_piece0
> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:839
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3
> (MapPartitionsRDD[7] at saveAsTextFile at :27)
> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, nodo2.i3a.info , PROCESS_LOCAL, 25975 bytes)
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, nodo3.i3a.info , PROCESS_LOCAL, 25963 bytes)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo2.i3a.info:37759  (size: 46.6 KB, free: 
> 530.2 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo3.i3a.info:54798  (size: 46.6 KB, free: 
> 530.2 MB)
> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 312 ms on nodo2.i3a.info  (1/2)
> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 313 ms on nodo3.i3a.info  (2/2)
> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
> all completed, from pool
> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
> :27) finished in 0.334 s
> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
> :27, took 0.436388 s
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-creates-an-empty-folder-in-HDFS-tp24906.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> .
> 
> -
> To unsubscri

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ted Yu
bq.  val dist = sc.parallelize(l)

Following the above, can you call, e.g. count() on dist before saving ?

Cheers

On Fri, Oct 2, 2015 at 1:21 AM, jarias  wrote:

> Dear list,
>
> I'm experimenting a problem when trying to write any RDD to HDFS. I've
> tried
> with minimal examples, scala programs and pyspark programs both in local
> and
> cluster modes and as standalone applications or shells.
>
> My problem is that when invoking the write command, a task is executed but
> it just creates an empty folder in the given HDFS path. I'm lost at this
> point because there is no sign of error or warning in the spark logs.
>
> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
> working properly when using the command tools or running MapReduce jobs.
>
>
> Thank you for your time, I'm not sure if this is just a rookie mistake or
> an
> overall config problem.
>
> Just a working example:
>
> This sequence produces the following log and creates the empty folder
> "test":
>
> scala> val l = Seq.fill(1)(nextInt)
> scala> val dist = sc.parallelize(l)
> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/")
>
>
> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 1
> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
> :27
> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
> :27) with 2 output partitions (allowLocal=false)
> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at
> :27)
> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3
> (MapPartitionsRDD[7]
> at saveAsTextFile at :27), which has no missing parents
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
> curMem=184615, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 134.1 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
> curMem=321951, maxMem=278302556
> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as
> bytes
> in memory (estimated size 46.6 KB, free 265.1 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo1.i3a.info:36330 (size: 46.6 KB, free: 265.3 MB)
> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
> broadcast_3_piece0
> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:839
> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 3
> (MapPartitionsRDD[7] at saveAsTextFile at :27)
> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, nodo2.i3a.info, PROCESS_LOCAL, 25975 bytes)
> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, nodo3.i3a.info, PROCESS_LOCAL, 25963 bytes)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo2.i3a.info:37759 (size: 46.6 KB, free: 530.2 MB)
> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
> on nodo3.i3a.info:54798 (size: 46.6 KB, free: 530.2 MB)
> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 312 ms on nodo2.i3a.info (1/2)
> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 313 ms on nodo3.i3a.info (2/2)
> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
> all completed, from pool
> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
> :27) finished in 0.334 s
> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 finished: saveAsTextFile at
> :27, took 0.436388 s
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-creates-an-empty-folder-in-HDFS-tp24906.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: saveAsTextFile() part- files are missing

2015-05-21 Thread Tomasz Fruboes

Hi,

 it looks you are writing to a local filesystem. Could you try writing 
to a location visible by all nodes (master and workers), e.g. nfs share?


 HTH,
  Tomasz

W dniu 21.05.2015 o 17:16, rroxanaioana pisze:

Hello!
I just started with Spark. I have an application which counts words in a
file (1 MB file).
The file is stored locally. I loaded the file using native code and then
created the RDD from it.

 JavaRDD rddFromFile = context.parallelize(myFile,
2);
JavaRDD words = rddFromFile.flatMap(...);
JavaPairRDD pairs = words.mapToPair(...);
JavaPairRDD counter = pairs.reduceByKey(..);

counter.saveAsTextFile("file:///root/output");
context.close();

I have one master and 2 slaves. I run the program from the master node.
The output directory is created on the master node and on the 2 nodes. On
the master node I have only one file _SUCCES (empty) and on the nodes I have
_temporary file. I printed the counter at the console, the result seems ok.
What am I doing wrong?
Thank you!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-part-files-are-missing-tp22974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-16 Thread Ilya Ganelin
All - this issue showed up when I was tearing down a spark context and
creating a new one. Often, I was unable to then write to HDFS due to this
error. I subsequently switched to a different implementation where instead
of tearing down and re initializing the spark context I'd instead submit a
separate request to YARN.
On Fri, May 15, 2015 at 2:35 PM Puneet Kapoor 
wrote:

> I am seeing this on hadoop 2.4.0 version.
>
> Thanks for your suggestions, i will try those and let you know if they
> help !
>
> On Sat, May 16, 2015 at 1:57 AM, Steve Loughran 
> wrote:
>
>>  What version of Hadoop are you seeing this on?
>>
>>
>>  On 15 May 2015, at 20:03, Puneet Kapoor 
>> wrote:
>>
>>  Hey,
>>
>>  Did you find any solution for this issue, we are seeing similar logs in
>> our Data node logs. Appreciate any help.
>>
>>
>>
>>
>>
>>  2015-05-15 10:51:43,615 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation
>>  src: /192.168.112.190:46253 dst: /192.168.151.104:50010
>> java.net.SocketTimeoutException: 6 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.151.104:50010
>> remote=/192.168.112.190:46253]
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>> at java.io.BufferedInputStream.fill(Unknown Source)
>> at java.io.BufferedInputStream.read1(Unknown Source)
>> at java.io.BufferedInputStream.read(Unknown Source)
>> at java.io.DataInputStream.read(Unknown Source)
>> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>> at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
>> at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
>> at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:742)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>> at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
>> at java.lang.Thread.run(Unknown Source)
>>
>>
>>  That's being logged @ error level in DN. It doesn't mean the DN has
>> crashed, only that it timed out waiting for data: something has gone wrong
>> elsewhere.
>>
>>  https://issues.apache.org/jira/browse/HDFS-693
>>
>>
>> there's a couple of properties you can do to extend timeouts
>>
>>   
>>
>> dfs.socket.timeout
>>
>> 2
>>
>> 
>>
>>
>> 
>>
>> dfs.datanode.socket.write.timeout
>>
>> 2
>>
>> 
>>
>>
>>
>> You can also increase the number of data node tranceiver threads to
>> handle data IO across the network
>>
>>
>> 
>> dfs.datanode.max.xcievers
>> 4096
>> 
>>
>> Yes, that property has that explicit spellinng, it's easy to get wrong
>>
>>
>


Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-15 Thread Puneet Kapoor
I am seeing this on hadoop 2.4.0 version.

Thanks for your suggestions, i will try those and let you know if they help
!

On Sat, May 16, 2015 at 1:57 AM, Steve Loughran 
wrote:

>  What version of Hadoop are you seeing this on?
>
>
>  On 15 May 2015, at 20:03, Puneet Kapoor 
> wrote:
>
>  Hey,
>
>  Did you find any solution for this issue, we are seeing similar logs in
> our Data node logs. Appreciate any help.
>
>
>
>
>
>  2015-05-15 10:51:43,615 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation
>  src: /192.168.112.190:46253 dst: /192.168.151.104:50010
> java.net.SocketTimeoutException: 6 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.151.104:50010
> remote=/192.168.112.190:46253]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.BufferedInputStream.fill(Unknown Source)
> at java.io.BufferedInputStream.read1(Unknown Source)
> at java.io.BufferedInputStream.read(Unknown Source)
> at java.io.DataInputStream.read(Unknown Source)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:742)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Unknown Source)
>
>
>  That's being logged @ error level in DN. It doesn't mean the DN has
> crashed, only that it timed out waiting for data: something has gone wrong
> elsewhere.
>
>  https://issues.apache.org/jira/browse/HDFS-693
>
>
> there's a couple of properties you can do to extend timeouts
>
>   
>
> dfs.socket.timeout
>
> 2
>
> 
>
>
> 
>
> dfs.datanode.socket.write.timeout
>
> 2
>
> 
>
>
>
> You can also increase the number of data node tranceiver threads to handle
> data IO across the network
>
>
> 
> dfs.datanode.max.xcievers
> 4096
> 
>
> Yes, that property has that explicit spellinng, it's easy to get wrong
>
>


Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-15 Thread Steve Loughran
What version of Hadoop are you seeing this on?


On 15 May 2015, at 20:03, Puneet Kapoor 
mailto:puneet.cse.i...@gmail.com>> wrote:

Hey,

Did you find any solution for this issue, we are seeing similar logs in our 
Data node logs. Appreciate any help.





2015-05-15 10:51:43,615 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation  src: 
/192.168.112.190:46253 dst: 
/192.168.151.104:50010
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/192.168.151.104:50010 
remote=/192.168.112.190:46253]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.read(Unknown Source)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:742)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Unknown Source)


That's being logged @ error level in DN. It doesn't mean the DN has crashed, 
only that it timed out waiting for data: something has gone wrong elsewhere.


https://issues.apache.org/jira/browse/HDFS-693


there's a couple of properties you can do to extend timeouts

  

dfs.socket.timeout

2






dfs.datanode.socket.write.timeout

2




You can also increase the number of data node tranceiver threads to handle data 
IO across the network



dfs.datanode.max.xcievers
4096


Yes, that property has that explicit spellinng, it's easy to get wrong



Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-15 Thread Puneet Kapoor
Hey,

Did you find any solution for this issue, we are seeing similar logs in our
Data node logs. Appreciate any help.


2015-05-15 10:51:43,615 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation
 src: /192.168.112.190:46253 dst: /192.168.151.104:50010
java.net.SocketTimeoutException: 6 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.151.104:50010
remote=/192.168.112.190:46253]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.read(Unknown Source)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:742)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Unknown Source)

Thanks
Puneet

On Wed, Dec 3, 2014 at 2:50 AM, Ganelin, Ilya 
wrote:

> Hi all, as the last stage of execution, I am writing out a dataset to disk. 
> Before I do this, I force the DAG to resolve so this is the only job left in 
> the pipeline. The dataset in question is not especially large (a few 
> gigabytes). During this step however, HDFS will inevitable crash. I will lose 
> connection to data-nodes and get stuck in the loop of death – failure causes 
> job restart, eventually causing the overall job to fail. On the data node 
> logs I see the errors below. Does anyone have any ideas as to what is going 
> on here? Thanks!
>
>
> java.io.IOException: Premature EOF from inputStream
>   at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:741)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:718)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:72)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
>   at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> innovationdatanode03.cof.ds.capitalone.com:1004:DataXceiver error processing 
> WRITE_BLOCK operation  src: /10.37.248.60:44676 dst: /10.37.248.59:1004
> java.net.SocketTimeoutException: 65000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.37.248.59:43692 remote=/10.37.248.63:1004]
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>   at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>   at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>   at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>   at java.io.FilterInputStream.read(FilterInputStream.java:83)
>   at java.io.FilterInputStream.read(FilterInputStream.java:83)
>   at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2101)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:660)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Re

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
Thanks much for your help.

Here's what was happening ...
The HDP VM was running in VirtualBox and host was connected to the guest VM
in NAT mode. When I connected this in "Bridged Adapter" mode it worked !

On Tue, May 5, 2015 at 8:54 PM, ayan guha  wrote:

> Try to add one more data node or make minreplication to 0. Hdfs is trying
> to replicate at least one more copy and not able to find another DN to do
> thay
> On 6 May 2015 09:37, "Sudarshan Murty"  wrote:
>
>> Another thing - could it be a permission problem ?
>> It creates all the directory structure (in red)/tmp/wordcount/
>> _temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>> so I am guessing not.
>>
>> On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty 
>> wrote:
>>
>>> You are most probably right. I assumed others may have run into this.
>>> When I try to put the files in there, it creates a directory structure
>>> with the part-0 and part1 files but these files are of size 0 - no
>>> content. The client error and the server logs have  the error message shown
>>> - which seem to indicate that the system is aware that a datanode exists
>>> but is excluded from the operation. So, it looks like it is not partitioned
>>> and Ambari indicates that HDFS is in good health with one NN, one SN, one
>>> DN.
>>> I am unable to figure out what the issue is.
>>> thanks for your help.
>>>
>>> On Tue, May 5, 2015 at 6:39 PM, ayan guha  wrote:
>>>
 What happens when you try to put files to your hdfs from local
 filesystem? Looks like its a hdfs issue rather than spark thing.
 On 6 May 2015 05:04, "Sudarshan"  wrote:

> I have searched all replies to this question & not found an answer.
>
> I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
> side, on the same machine and trying to write output of wordcount program 
> into HDFS (works fine writing to a local file, /tmp/wordcount).
>
> Only line I added to the wordcount program is: (where 'counts' is the 
> JavaPairRDD)
> *counts.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
> ");*
>
> When I check in HDFS at that location (/tmp) here's what I find.
> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
> and
> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>
> and *both part-000[01] are 0 size files*.
>
> The wordcount client output error is:
> [Stage 1:>  (0 + 
> 2) / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>  *could only be replicated to 0 nodes instead of minReplication (=1).  
> There are 1 datanode(s) running and 1 node(s) are excluded in this 
> operation.*
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)
>
>
> I tried this with Spark 1.2.1 same error.
> I have plenty of space on the DFS.
> The Name Node, Sec Name Node & the one Data Node are all healthy.
>
> Any hint as to what may be the problem ?
> thanks in advance.
> Sudarshan
>
>
> --
> View this message in context: saveAsTextFile() to save output of
> Spark program to HDFS
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

>>>
>>


Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
Try to add one more data node or make minreplication to 0. Hdfs is trying
to replicate at least one more copy and not able to find another DN to do
thay
On 6 May 2015 09:37, "Sudarshan Murty"  wrote:

> Another thing - could it be a permission problem ?
> It creates all the directory structure (in red)/tmp/wordcount/
> _temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
> so I am guessing not.
>
> On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty  wrote:
>
>> You are most probably right. I assumed others may have run into this.
>> When I try to put the files in there, it creates a directory structure
>> with the part-0 and part1 files but these files are of size 0 - no
>> content. The client error and the server logs have  the error message shown
>> - which seem to indicate that the system is aware that a datanode exists
>> but is excluded from the operation. So, it looks like it is not partitioned
>> and Ambari indicates that HDFS is in good health with one NN, one SN, one
>> DN.
>> I am unable to figure out what the issue is.
>> thanks for your help.
>>
>> On Tue, May 5, 2015 at 6:39 PM, ayan guha  wrote:
>>
>>> What happens when you try to put files to your hdfs from local
>>> filesystem? Looks like its a hdfs issue rather than spark thing.
>>> On 6 May 2015 05:04, "Sudarshan"  wrote:
>>>
 I have searched all replies to this question & not found an answer.

 I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
 side, on the same machine and trying to write output of wordcount program 
 into HDFS (works fine writing to a local file, /tmp/wordcount).

 Only line I added to the wordcount program is: (where 'counts' is the 
 JavaPairRDD)
 *counts.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
 ");*

 When I check in HDFS at that location (/tmp) here's what I find.
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
 and
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1

 and *both part-000[01] are 0 size files*.

 The wordcount client output error is:
 [Stage 1:>  (0 + 
 2) / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
  *could only be replicated to 0 nodes instead of minReplication (=1).  
 There are 1 datanode(s) running and 1 node(s) are excluded in this 
 operation.*
at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)


 I tried this with Spark 1.2.1 same error.
 I have plenty of space on the DFS.
 The Name Node, Sec Name Node & the one Data Node are all healthy.

 Any hint as to what may be the problem ?
 thanks in advance.
 Sudarshan


 --
 View this message in context: saveAsTextFile() to save output of Spark
 program to HDFS
 
 Sent from the Apache Spark User List mailing list archive
  at Nabble.com.

>>>
>>
>


Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
Another thing - could it be a permission problem ?
It creates all the directory structure (in red)/tmp/wordcount/
_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
so I am guessing not.

On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty  wrote:

> You are most probably right. I assumed others may have run into this.
> When I try to put the files in there, it creates a directory structure
> with the part-0 and part1 files but these files are of size 0 - no
> content. The client error and the server logs have  the error message shown
> - which seem to indicate that the system is aware that a datanode exists
> but is excluded from the operation. So, it looks like it is not partitioned
> and Ambari indicates that HDFS is in good health with one NN, one SN, one
> DN.
> I am unable to figure out what the issue is.
> thanks for your help.
>
> On Tue, May 5, 2015 at 6:39 PM, ayan guha  wrote:
>
>> What happens when you try to put files to your hdfs from local
>> filesystem? Looks like its a hdfs issue rather than spark thing.
>> On 6 May 2015 05:04, "Sudarshan"  wrote:
>>
>>> I have searched all replies to this question & not found an answer.
>>>
>>> I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
>>> side, on the same machine and trying to write output of wordcount program 
>>> into HDFS (works fine writing to a local file, /tmp/wordcount).
>>>
>>> Only line I added to the wordcount program is: (where 'counts' is the 
>>> JavaPairRDD)
>>> *counts.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
>>> ");*
>>>
>>> When I check in HDFS at that location (/tmp) here's what I find.
>>> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
>>> and
>>> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>>>
>>> and *both part-000[01] are 0 size files*.
>>>
>>> The wordcount client output error is:
>>> [Stage 1:>  (0 + 2) 
>>> / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
>>> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
>>> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>>>  *could only be replicated to 0 nodes instead of minReplication (=1).  
>>> There are 1 datanode(s) running and 1 node(s) are excluded in this 
>>> operation.*
>>> at 
>>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)
>>>
>>>
>>> I tried this with Spark 1.2.1 same error.
>>> I have plenty of space on the DFS.
>>> The Name Node, Sec Name Node & the one Data Node are all healthy.
>>>
>>> Any hint as to what may be the problem ?
>>> thanks in advance.
>>> Sudarshan
>>>
>>>
>>> --
>>> View this message in context: saveAsTextFile() to save output of Spark
>>> program to HDFS
>>> 
>>> Sent from the Apache Spark User List mailing list archive
>>>  at Nabble.com.
>>>
>>
>


Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
You are most probably right. I assumed others may have run into this.
When I try to put the files in there, it creates a directory structure with
the part-0 and part1 files but these files are of size 0 - no
content. The client error and the server logs have  the error message shown
- which seem to indicate that the system is aware that a datanode exists
but is excluded from the operation. So, it looks like it is not partitioned
and Ambari indicates that HDFS is in good health with one NN, one SN, one
DN.
I am unable to figure out what the issue is.
thanks for your help.

On Tue, May 5, 2015 at 6:39 PM, ayan guha  wrote:

> What happens when you try to put files to your hdfs from local filesystem?
> Looks like its a hdfs issue rather than spark thing.
> On 6 May 2015 05:04, "Sudarshan"  wrote:
>
>> I have searched all replies to this question & not found an answer.
>>
>> I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
>> side, on the same machine and trying to write output of wordcount program 
>> into HDFS (works fine writing to a local file, /tmp/wordcount).
>>
>> Only line I added to the wordcount program is: (where 'counts' is the 
>> JavaPairRDD)
>> *counts.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
>> ");*
>>
>> When I check in HDFS at that location (/tmp) here's what I find.
>> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
>> and
>> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>>
>> and *both part-000[01] are 0 size files*.
>>
>> The wordcount client output error is:
>> [Stage 1:>  (0 + 2) 
>> / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
>> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
>> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>>  *could only be replicated to 0 nodes instead of minReplication (=1).  There 
>> are 1 datanode(s) running and 1 node(s) are excluded in this operation.*
>>  at 
>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)
>>
>>
>> I tried this with Spark 1.2.1 same error.
>> I have plenty of space on the DFS.
>> The Name Node, Sec Name Node & the one Data Node are all healthy.
>>
>> Any hint as to what may be the problem ?
>> thanks in advance.
>> Sudarshan
>>
>>
>> --
>> View this message in context: saveAsTextFile() to save output of Spark
>> program to HDFS
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>


Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, "Sudarshan"  wrote:

>
> I have searched all replies to this question & not found an answer.
>
> I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
> side, on the same machine and trying to write output of wordcount program 
> into HDFS (works fine writing to a local file, /tmp/wordcount).
>
> Only line I added to the wordcount program is: (where 'counts' is the 
> JavaPairRDD)
> *counts.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
> ");*
>
> When I check in HDFS at that location (/tmp) here's what I find.
> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
> and
> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>
> and *both part-000[01] are 0 size files*.
>
> The wordcount client output error is:
> [Stage 1:>  (0 + 2) / 
> 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
>  *could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and 1 node(s) are excluded in this operation.*
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)
>
>
> I tried this with Spark 1.2.1 same error.
> I have plenty of space on the DFS.
> The Name Node, Sec Name Node & the one Data Node are all healthy.
>
> Any hint as to what may be the problem ?
> thanks in advance.
> Sudarshan
>
>
> --
> View this message in context: saveAsTextFile() to save output of Spark
> program to HDFS
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Copy should be doable but I'm not sure how to specify a prefix for the 
directory while keeping the filename (ie part-0) fixed in copy command.



> On Apr 16, 2015, at 1:51 PM, Sean Owen  wrote:
> 
> Just copy the files? it shouldn't matter that much where they are as
> you can find them easily. Or consider somehow sending the batches of
> data straight into Redshift? no idea how that is done but I imagine
> it's doable.
> 
> On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy
>  wrote:
>> Thanks Sean. I want to load each batch into Redshift. What's the best/most 
>> efficient way to do that?
>> 
>> Vadim
>> 
>> 
>>> On Apr 16, 2015, at 1:35 PM, Sean Owen  wrote:
>>> 
>>> You can't, since that's how it's designed to work. Batches are saved
>>> in different "files", which are really directories containing
>>> partitions, as is common in Hadoop. You can move them later, or just
>>> read them where they are.
>>> 
>>> On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
>>>  wrote:
 I am using Spark Streaming where during each micro-batch I output data to 
 S3
 using
 saveAsTextFile. Right now each batch of data is put into its own directory
 containing
 2 objects, "_SUCCESS" and "part-0."
 
 How do I output each batch into a common directory?
 
 Thanks,
 Vadim
 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Also to juggle even further the multithreading model of both spark and HDFS you 
can even publish the data from spark first to a message broker e.g. kafka from 
where a predetermined number (from 1 to infinity) of parallel consumers will 
retrieve and store in HDFS in one or more finely controlled files and 
directories  

 

From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:45 PM
To: Evo Eftimov
Cc: 
Subject: Re: saveAsTextFile

 

Thanks Evo for your detailed explanation.


On Apr 16, 2015, at 1:38 PM, Evo Eftimov  wrote:

The reason for this is as follows:

 

1.  You are saving data on HDFS

2.  HDFS as a cluster/server side Service has a Single Writer / Multiple 
Reader multithreading model 

3.  Hence each thread of execution in Spark has to write to a separate file 
in HDFS

4.  Moreover the RDDs are partitioned across cluster nodes and operated 
upon by multiple threads there and on top of that in Spark Streaming you have 
many micro-batch RDDs streaming in all the time as part of a DStream  

 

If you want fine / detailed management of the writing to HDFS you can implement 
your own HDFS adapter and invoke it in forEachRDD and foreach 

 

Regards

Evo Eftimov  

 

From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:33 PM
To: user@spark.apache.org
Subject: saveAsTextFile

 

I am using Spark Streaming where during each micro-batch I output data to S3 
using

saveAsTextFile. Right now each batch of data is put into its own directory 
containing

2 objects, "_SUCCESS" and "part-0."

 

How do I output each batch into a common directory?

 

Thanks,

Vadim

  
<https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=057349bb-29a2-4296-82b7-c52b46ae19f6>
 ᐧ

  
<http://t.signauxcinq.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v5dsrxW7gbZX-8q-6ZdVdnPvF2zlZNzW3hF9wD1k1H6H0?si=5533377798602752&pi=ff283f35-99c4-4b15-dd07-91df78970bf8>
 



Re: saveAsTextFile

2015-04-16 Thread Sean Owen
Just copy the files? it shouldn't matter that much where they are as
you can find them easily. Or consider somehow sending the batches of
data straight into Redshift? no idea how that is done but I imagine
it's doable.

On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy
 wrote:
> Thanks Sean. I want to load each batch into Redshift. What's the best/most 
> efficient way to do that?
>
> Vadim
>
>
>> On Apr 16, 2015, at 1:35 PM, Sean Owen  wrote:
>>
>> You can't, since that's how it's designed to work. Batches are saved
>> in different "files", which are really directories containing
>> partitions, as is common in Hadoop. You can move them later, or just
>> read them where they are.
>>
>> On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
>>  wrote:
>>> I am using Spark Streaming where during each micro-batch I output data to S3
>>> using
>>> saveAsTextFile. Right now each batch of data is put into its own directory
>>> containing
>>> 2 objects, "_SUCCESS" and "part-0."
>>>
>>> How do I output each batch into a common directory?
>>>
>>> Thanks,
>>> Vadim
>>> ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Evo for your detailed explanation.

> On Apr 16, 2015, at 1:38 PM, Evo Eftimov  wrote:
> 
> The reason for this is as follows:
>  
> 1.   You are saving data on HDFS
> 2.   HDFS as a cluster/server side Service has a Single Writer / Multiple 
> Reader multithreading model
> 3.   Hence each thread of execution in Spark has to write to a separate 
> file in HDFS
> 4.   Moreover the RDDs are partitioned across cluster nodes and operated 
> upon by multiple threads there and on top of that in Spark Streaming you have 
> many micro-batch RDDs streaming in all the time as part of a DStream  
>  
> If you want fine / detailed management of the writing to HDFS you can 
> implement your own HDFS adapter and invoke it in forEachRDD and foreach
>  
> Regards
> Evo Eftimov  
>  
> From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
> Sent: Thursday, April 16, 2015 6:33 PM
> To: user@spark.apache.org
> Subject: saveAsTextFile
>  
> I am using Spark Streaming where during each micro-batch I output data to S3 
> using
> saveAsTextFile. Right now each batch of data is put into its own directory 
> containing
> 2 objects, "_SUCCESS" and "part-0."
>  
> How do I output each batch into a common directory?
>  
> Thanks,
> Vadim
> ᐧ


RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them 
wherever you want - Use foreacPartition and then foreach 

-Original Message-
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:39 PM
To: Sean Owen
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile

Thanks Sean. I want to load each batch into Redshift. What's the best/most 
efficient way to do that?

Vadim


> On Apr 16, 2015, at 1:35 PM, Sean Owen  wrote:
> 
> You can't, since that's how it's designed to work. Batches are saved 
> in different "files", which are really directories containing 
> partitions, as is common in Hadoop. You can move them later, or just 
> read them where they are.
> 
> On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy 
>  wrote:
>> I am using Spark Streaming where during each micro-batch I output 
>> data to S3 using saveAsTextFile. Right now each batch of data is put 
>> into its own directory containing
>> 2 objects, "_SUCCESS" and "part-0."
>> 
>> How do I output each batch into a common directory?
>> 
>> Thanks,
>> Vadim
>> ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier 

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, April 16, 2015 6:35 PM
To: Vadim Bichutskiy
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile

You can't, since that's how it's designed to work. Batches are saved in 
different "files", which are really directories containing partitions, as is 
common in Hadoop. You can move them later, or just read them where they are.

On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy  
wrote:
> I am using Spark Streaming where during each micro-batch I output data 
> to S3 using saveAsTextFile. Right now each batch of data is put into 
> its own directory containing
> 2 objects, "_SUCCESS" and "part-0."
>
> How do I output each batch into a common directory?
>
> Thanks,
> Vadim
> ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
The reason for this is as follows:

 

1.   You are saving data on HDFS

2.   HDFS as a cluster/server side Service has a Single Writer / Multiple 
Reader multithreading model 

3.   Hence each thread of execution in Spark has to write to a separate 
file in HDFS

4.   Moreover the RDDs are partitioned across cluster nodes and operated 
upon by multiple threads there and on top of that in Spark Streaming you have 
many micro-batch RDDs streaming in all the time as part of a DStream  

 

If you want fine / detailed management of the writing to HDFS you can implement 
your own HDFS adapter and invoke it in forEachRDD and foreach 

 

Regards

Evo Eftimov  

 

From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] 
Sent: Thursday, April 16, 2015 6:33 PM
To: user@spark.apache.org
Subject: saveAsTextFile

 

I am using Spark Streaming where during each micro-batch I output data to S3 
using

saveAsTextFile. Right now each batch of data is put into its own directory 
containing

2 objects, "_SUCCESS" and "part-0."

 

How do I output each batch into a common directory?

 

Thanks,

Vadim

  

 



Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Sean. I want to load each batch into Redshift. What's the best/most 
efficient way to do that?

Vadim


> On Apr 16, 2015, at 1:35 PM, Sean Owen  wrote:
> 
> You can't, since that's how it's designed to work. Batches are saved
> in different "files", which are really directories containing
> partitions, as is common in Hadoop. You can move them later, or just
> read them where they are.
> 
> On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
>  wrote:
>> I am using Spark Streaming where during each micro-batch I output data to S3
>> using
>> saveAsTextFile. Right now each batch of data is put into its own directory
>> containing
>> 2 objects, "_SUCCESS" and "part-0."
>> 
>> How do I output each batch into a common directory?
>> 
>> Thanks,
>> Vadim
>> ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-04-16 Thread Sean Owen
You can't, since that's how it's designed to work. Batches are saved
in different "files", which are really directories containing
partitions, as is common in Hadoop. You can move them later, or just
read them where they are.

On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
 wrote:
> I am using Spark Streaming where during each micro-batch I output data to S3
> using
> saveAsTextFile. Right now each batch of data is put into its own directory
> containing
> 2 objects, "_SUCCESS" and "part-0."
>
> How do I output each batch into a common directory?
>
> Thanks,
> Vadim
> ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid
is your data skewed?  Could it be that there are a few keys with a huge
number of records?  You might consider outputting
(recordA, count)
(recordB, count)

instead of

recordA
recordA
recordA
...


you could do this with:

input = sc.textFile
pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}
sorted = pairs.sortByKey
sorted.saveAsTextFile


On Mon, Mar 9, 2015 at 12:31 PM, mingweili0x  wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Sean Owen
This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, "mingweili0x"  wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: saveAsTextFile extremely slow near finish

2015-03-09 Thread Akhil Das
Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x  wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: saveAsTextFile of RDD[Array[Any]]

2015-02-09 Thread Jong Wook Kim
If you have `RDD[Array[Any]]` you can do

rdd.map(_.mkString("\t"))

or with some other delimiter to make it `RDD[String]`, and then call
`saveAsTextFile`.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p21554.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SaveAsTextFile to S3 bucket

2015-01-27 Thread Thomas Demoor
S3 does not have the concept "directory". An S3 bucket only holds files
(objects). The hadoop filesystem is mapped onto a bucket and use
Hadoop-specific (or rather "s3tool"-specific: s3n uses the jets3t tool)
conventions(hacks) to fake directories such as a ending with a slash
("filename/") and with s3n by "filename_$folder$" (these are leaky
abstractions, google that if you ever have some spare time :p). S3 simply
doesn't (and shouldn't) know about these conventions. Again, a bucket just
holds a shitload of files. This might seem inconvenient but directories are
really bad idea for scalable storage. However, setting "folder-like"
permissions can be done through IAM:
http://docs.aws.amazon.com/AmazonS3/latest/dev/example-policies-s3.html#iam-policy-ex1

Summarizing: by setting permissions on /dev you set permissions on that
object. It has no effect on the file "/dev/output" which is, as far as S3
cares, another object that happens to share part of the objectname with
/dev.

Thomas Demoor
skype: demoor.thomas
mobile: +32 497883833

On Tue, Jan 27, 2015 at 6:33 AM, Chen, Kevin  wrote:

>  When spark saves rdd to a text file, the directory must not exist
> upfront. It will create a directory and write the data to part- under
> that directory. In my use case, I create a directory dev in the bucket ://
> nexgen-software/dev . I expect it creates output direct under dev and a
> part- under output. But it gave me exception as I only give write
> permission to dev not the bucket. If I open up write permission to bucket,
> it worked. But it did not create output directory under dev, it rather
> creates another dev/output directory under bucket. I just want to know if
> it is possible to have output directory created under dev directory I
> created upfront.
>
>   From: Nick Pentreath 
> Date: Monday, January 26, 2015 9:15 PM
> To: "user@spark.apache.org" 
> Subject: Re: SaveAsTextFile to S3 bucket
>
>   Your output folder specifies
>
>  rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>
>  So it will try to write to /dev/output which is as expected. If you
> create the directory /dev/output upfront in your bucket, and try to save it
> to that (empty) directory, what is the behaviour?
>
> On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin 
> wrote:
>
>>  Does anyone know if I can save a RDD as a text file to a pre-created
>> directory in S3 bucket?
>>
>>  I have a directory created in S3 bucket: //nexgen-software/dev
>>
>>  When I tried to save a RDD as text file in this directory:
>> rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>>
>>
>>  I got following exception at runtime:
>>
>> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
>> org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
>> ResponseCode=403, ResponseMessage=Forbidden
>>
>>
>>  I have verified /dev has write permission. However, if I grant the
>> bucket //nexgen-software write permission, I don't get exception. But the
>> output is not created under dev. Rather, a different /dev/output directory
>> is created directory in the bucket (//nexgen-software). Is this how
>> saveAsTextFile behalves in S3? Is there anyway I can have output created
>> under a pre-defied directory.
>>
>>
>>  Thanks in advance.
>>
>>
>>
>>
>>
>


Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
When spark saves rdd to a text file, the directory must not exist upfront. It 
will create a directory and write the data to part- under that directory. 
In my use case, I create a directory dev in the bucket ://nexgen-software/dev . 
I expect it creates output direct under dev and a part- under output. But 
it gave me exception as I only give write permission to dev not the bucket. If 
I open up write permission to bucket, it worked. But it did not create output 
directory under dev, it rather creates another dev/output directory under 
bucket. I just want to know if it is possible to have output directory created 
under dev directory I created upfront.

From: Nick Pentreath mailto:nick.pentre...@gmail.com>>
Date: Monday, January 26, 2015 9:15 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SaveAsTextFile to S3 bucket

Your output folder specifies

rdd.saveAsTextFile("s3n://nexgen-software/dev/output");

So it will try to write to /dev/output which is as expected. If you create the 
directory /dev/output upfront in your bucket, and try to save it to that 
(empty) directory, what is the behaviour?

On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin 
mailto:kevin.c...@neustar.biz>> wrote:
Does anyone know if I can save a RDD as a text file to a pre-created directory 
in S3 bucket?

I have a directory created in S3 bucket: //nexgen-software/dev

When I tried to save a RDD as text file in this directory:
rdd.saveAsTextFile("s3n://nexgen-software/dev/output");


I got following exception at runtime:

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: 
org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - 
ResponseCode=403, ResponseMessage=Forbidden


I have verified /dev has write permission. However, if I grant the bucket 
//nexgen-software write permission, I don't get exception. But the output is 
not created under dev. Rather, a different /dev/output directory is created 
directory in the bucket (//nexgen-software). Is this how saveAsTextFile 
behalves in S3? Is there anyway I can have output created under a pre-defied 
directory.


Thanks in advance.






Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Ashish Rangole
By default, the files will be created under the path provided as the
argument for saveAsTextFile. This argument is considered as a folder in the
bucket and actual files are created in it with the naming convention
part-n, where n is the number of output partition.

On Mon, Jan 26, 2015 at 9:15 PM, Nick Pentreath 
wrote:

> Your output folder specifies
>
> rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>
> So it will try to write to /dev/output which is as expected. If you create
> the directory /dev/output upfront in your bucket, and try to save it to
> that (empty) directory, what is the behaviour?
>
> On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin 
> wrote:
>
>>  Does anyone know if I can save a RDD as a text file to a pre-created
>> directory in S3 bucket?
>>
>>  I have a directory created in S3 bucket: //nexgen-software/dev
>>
>>  When I tried to save a RDD as text file in this directory:
>> rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>>
>>
>>  I got following exception at runtime:
>>
>> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
>> org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
>> ResponseCode=403, ResponseMessage=Forbidden
>>
>>
>>  I have verified /dev has write permission. However, if I grant the
>> bucket //nexgen-software write permission, I don't get exception. But the
>> output is not created under dev. Rather, a different /dev/output directory
>> is created directory in the bucket (//nexgen-software). Is this how
>> saveAsTextFile behalves in S3? Is there anyway I can have output created
>> under a pre-defied directory.
>>
>>
>>  Thanks in advance.
>>
>>
>>
>>
>>
>


Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Nick Pentreath
Your output folder specifies

rdd.saveAsTextFile("s3n://nexgen-software/dev/output");

So it will try to write to /dev/output which is as expected. If you create
the directory /dev/output upfront in your bucket, and try to save it to
that (empty) directory, what is the behaviour?

On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin  wrote:

>  Does anyone know if I can save a RDD as a text file to a pre-created
> directory in S3 bucket?
>
>  I have a directory created in S3 bucket: //nexgen-software/dev
>
>  When I tried to save a RDD as text file in this directory:
> rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>
>
>  I got following exception at runtime:
>
> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
> org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
> ResponseCode=403, ResponseMessage=Forbidden
>
>
>  I have verified /dev has write permission. However, if I grant the
> bucket //nexgen-software write permission, I don't get exception. But the
> output is not created under dev. Rather, a different /dev/output directory
> is created directory in the bucket (//nexgen-software). Is this how
> saveAsTextFile behalves in S3? Is there anyway I can have output created
> under a pre-defied directory.
>
>
>  Thanks in advance.
>
>
>
>
>


Re: saveAsTextFile

2015-01-15 Thread ankits
I have seen this happen when the RDD contains null values. Essentially,
saveAsTextFile calls toString() on the elements of the RDD, so a call to
null.toString will result in an NPE.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-01-15 Thread Prannoy
Hi,

Before saving the rdd do a collect to the rdd and print the content of the
rdd. Probably its a null value.

Thanks.

On Sat, Jan 3, 2015 at 5:37 PM, Pankaj Narang [via Apache Spark User List] <
ml-node+s1001560n20953...@n3.nabble.com> wrote:

> If you can paste the code here I can certainly help.
>
> Also confirm the version of spark you are using
>
> Regards
> Pankaj
> Infoshore Software
> India
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Reynold Xin
It is just calling RDD's saveAsTextFile. I guess we should really override
the saveAsTextFile in SchemaRDD (or make Row.toString comma separated).

Do you mind filing a JIRA ticket and copy me?


On Tue, Jan 13, 2015 at 12:03 AM, Kevin Burton  wrote:

> This is almost funny.
>
> I want to dump a computation to the filesystem.  It’s just the result of a
> Spark SQL call reading the data from Cassandra.
>
> The problem is that it looks like it’s just calling toString() which is
> useless.
>
> The example is below.
>
> I assume this is just a (bad) bug.
>
> org.apache.spark.sql.api.java.Row@37f108
>
> org.apache.spark.sql.api.java.Row@d0426773
>
> org.apache.spark.sql.api.java.Row@38c9d3
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
> 
>
>


Re: saveAsTextFile

2015-01-03 Thread Sanjay Subramanian

@lailaBased on the error u mentioned in the nabble link below, it seems like 
there are no permissions to write to HDFS. So this is possibly why 
saveAsTextFile is failing.

  From: Pankaj Narang 
 To: user@spark.apache.org 
 Sent: Saturday, January 3, 2015 4:07 AM
 Subject: Re: saveAsTextFile
   
If you can paste the code here I can certainly help.

Also confirm the version of spark you are using

Regards
Pankaj 
Infoshore Software 
India



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang
If you can paste the code here I can certainly help.

Also confirm the version of spark you are using

Regards
Pankaj 
Infoshore Software 
India



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile error

2014-11-15 Thread Prannoy
Hi Niko, 

Have you tried it running keeping the wordCounts.print() ?? Possibly the
import to the package *org.apache.spark.streaming._* is not there so during
sbt package it is unable to locate the saveAsTextFile API. 

Go to
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
to check if all the needed packages are there. 

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-error-tp18960p19006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile error

2014-11-14 Thread Harold Nguyen
Hi Niko,

It looks like you are calling a method on DStream, which does not exist.

Check out:
https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams

for the method "saveAsTextFiles"

Harold

On Fri, Nov 14, 2014 at 10:39 AM, Niko Gamulin 
wrote:

> Hi,
>
> I tried to modify NetworkWordCount example in order to save the output to
> a file.
>
> In NetworkWordCount.scala I replaced the line
>
> wordCounts.print()
> with
> wordCounts.saveAsTextFile("/home/bart/rest_services/output.txt")
>
> When I ran sbt/sbt package it returned the following error:
>
> [error]
> /home/bart/spark-1.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCountModified.scala:57:
> value saveAsTextFile is not a member of
> org.apache.spark.streaming.dstream.DStream[(String, Int)]
> [error]
> wordCounts.saveAsTextFile("/home/bart/rest_services/output.txt")
> [error]^
> [error] one error found
> [error] (examples/compile:compile) Compilation failed
> [error] Total time: 5 s, completed Nov 14, 2014 9:38:53 PM
>
> Does anyone know why this error occurs?
> Is there any other way to save the results to a text file?
>
> Regards,
>
> Niko
>
>
>
>
>


Re: saveAsTextFile makes no progress without caching RDD

2014-09-02 Thread jerryye
As an update. I'm still getting the same issue. I ended up doing a coalesce
instead of a cache to get around the memory issue but saveAsTextFile still
won't proceed without the coalesce or cache first.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-makes-no-progress-without-caching-RDD-tp12613p13283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
Hi David, 

Your job is probably hanging on the groupByKey process. Probably GC is kicking 
in and the process starts to hang or the data is unbalanced and you end up with 
stragglers (Once GC kicks in you'll start to get the connection errors you 
shared). If you don't care about the list of values itself, but the count of it 
(that appears to be what you're trying to save, correct me if I'm wrong), then 
I would suggest using `countByKey()` directly on 
`JavaPairRDD partitioned`. 

Best, 
Burak 

- Original Message -

From: "David"  
To: "user"  
Sent: Tuesday, August 19, 2014 1:44:18 PM 
Subject: saveAsTextFile hangs with hdfs 

I have a simple spark job that seems to hang when saving to hdfs. When looking 
at the spark web ui, the job reached 97 of 100 tasks completed. I need some 
help determining why the job appears to hang. The job hangs on the 
"saveAsTextFile()" call. 


https://www.dropbox.com/s/fdp7ck91hhm9w68/Screenshot%202014-08-19%2010.53.24.png
 

The job is pretty simple: 

JavaRDD analyticsLogs = context 
.textFile(Joiner.on(",").join(hdfs.glob("/spark-dfs", ".*\\.log$")), 4); 

JavaRDD flyweights = analyticsLogs 
.map(line -> { 
try { 
AnalyticsLog log = GSON.fromJson(line, AnalyticsLog.class); 
AnalyticsLogFlyweight flyweight = new AnalyticsLogFlyweight(); 
flyweight.ipAddress = log.getIpAddress(); 
flyweight.time = log.getTime(); 
flyweight.trackingId = log.getTrackingId(); 
return flyweight; 

} catch (Exception e) { 
LOG.error("error parsing json", e); 
return null; 
} 
}); 


JavaRDD filtered = flyweights 
.filter(log -> log != null); 

JavaPairRDD partitioned = filtered 
.mapToPair((AnalyticsLogFlyweight log) -> new Tuple2<>(log.trackingId, log)) 
.partitionBy(new HashPartitioner(100)).cache(); 


Ordering ordering = 
Ordering.natural().nullsFirst().onResultOf(new Function() { 
public Long apply(AnalyticsLogFlyweight log) { 
return log.time; 
} 
}); 

JavaPairRDD> stringIterableJavaPairRDD 
= partitioned.groupByKey(); 
JavaPairRDD stringIntegerJavaPairRDD = 
stringIterableJavaPairRDD.mapToPair((log) -> { 
List sorted = Lists.newArrayList(log._2()); 
sorted.forEach(l -> LOG.info("sorted {}", l)); 
return new Tuple2<>(log._1(), sorted.size()); 
}); 

String outputPath = "/summarized/groupedByTrackingId4"; 
hdfs.rm(outputPath, true); 
stringIntegerJavaPairRDD.saveAsTextFile(String.format("%s/%s", hdfs.getUrl(), 
outputPath)); 


Thanks in advance, David 



Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
Not sure if this is helpful or not, but in one executor "stderr" log, I found
this:

14/08/19 20:17:04 INFO CacheManager: Partition rdd_5_14 not found, computing
it
14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 16251 non-empty blocks out of 25435 blocks
14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 123 ms
14/08/19 20:34:00 INFO SendingConnection: Initiating connection to
[localhost/127.0.0.1:39840]
14/08/19 20:34:00 WARN SendingConnection: Error finishing connection to
localhost/127.0.0.1:39840
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:712)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14/08/19 20:34:00 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(localhost,39840)
14/08/19 20:34:00 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(localhost,39840)
14/08/19 20:34:08 INFO SendingConnection: Initiating connection to
[localhost/127.0.0.1:39840]
14/08/19 20:34:08 WARN SendingConnection: Error finishing connection to
localhost/127.0.0.1:39840
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:712)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14/08/19 20:34:08 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(localhost,39840)
14/08/19 20:34:08 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(localhost,39840)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-hangs-with-hdfs-tp12412p12420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
update: hangs even when not writing to hdfs.  I changed the code to avoid
saveAsTextFile() and instead do a forEachParitition and log the results. 
This time it hangs at 96/100 tasks, but still hangs.



I changed the saveAsTextFile to:

 stringIntegerJavaPairRDD.foreachPartition(p -> {
while (p.hasNext()) {
   LOG.info("{}", p.next());
}
});

Thanks, David.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-hangs-with-hdfs-tp12412p12419.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2014-08-10 Thread durin
This should work:
jobs.saveAsTextFile("file:home/hysom/testing") 

Note the 4 slashes, it's really 3 slashes + absolute path.

This should be mentioned in the docu though, I only remember that from
having seen it somewhere else.
The output folder, here "testing", will be created and must therefore not
exist before.


Best regards, Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp11803p11846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org