data locality in HDFS

2008-06-18 Thread Ian Holsman (Lists)

hi.

I want to run a distributed cluster, where i have say 20 machines/slaves 
in 3 seperate data centers that belong to the same cluster.


Ideally I would like the other machines in the data center to be able to 
upload files (apache log files in this case) onto the local slaves and 
then have map/red tasks do their magic without having to move data until 
the reduce phase where the amount of data will be smaller.


does Hadoop have this functionality?
how do people handle multi-datacenter logging with hadoop in this case? 
do you just copy the data into a centeral location?


regards
Ian


Re: dfs put fails

2008-06-18 Thread Alexander Arimond

Thank you, first tried the put from the master machine, which leads to
the error. The put from the slave machine works. Guess youre right with
the configuration parameters. Appears a bit strange to me, because the
firewall settings and the hadoop-site.xml on both machines are equal.

On Tue, 2008-06-17 at 14:08 -0700, Konstantin Shvachko wrote:
 Looks like the client machine from which you call -put cannot connect to the 
 data-nodes.
 It could be firewall or wrong configuration parameters that you use for the 
 client.
 
 Alexander Arimond wrote:
  hi, 
  
  i'm new in hadoop and im just testing it at the moment. 
  i set up a cluster with 2 nodes and it seems like they are running
  normally, 
  the log files of the namenode and the datanodes dont show errors. 
  Firewall should be set right. 
  but when i try to upload a file to the dfs i get following message: 
  
  [EMAIL PROTECTED]:~/hadoop$ bin/hadoop dfs -put file.txt file.txt 
  08/06/12 14:44:19 INFO dfs.DFSClient: Exception in
  createBlockOutputStream java.net.ConnectException: Connection refused 
  08/06/12 14:44:19 INFO dfs.DFSClient: Abandoning block
  blk_5837981856060447217 
  08/06/12 14:44:28 INFO dfs.DFSClient: Exception in
  createBlockOutputStream java.net.ConnectException: Connection refused 
  08/06/12 14:44:28 INFO dfs.DFSClient: Abandoning block
  blk_2573458924311304120 
  08/06/12 14:44:37 INFO dfs.DFSClient: Exception in
  createBlockOutputStream java.net.ConnectException: Connection refused 
  08/06/12 14:44:37 INFO dfs.DFSClient: Abandoning block
  blk_1207459436305221119 
  08/06/12 14:44:46 INFO dfs.DFSClient: Exception in
  createBlockOutputStream java.net.ConnectException: Connection refused 
  08/06/12 14:44:46 INFO dfs.DFSClient: Abandoning block
  blk_-8263828216969765661 
  08/06/12 14:44:52 WARN dfs.DFSClient: DataStreamer Exception:
  java.io.IOException: Unable to create new block. 
  08/06/12 14:44:52 WARN dfs.DFSClient: Error Recovery for block
  blk_-8263828216969765661 bad datanode[0] 
  
  
  dont know what that means and didnt found something about that.. 
  Hope somebody can help with that. 
  
  Thank you!
  
  
 



Re: is there a way to to debug hadoop from Eclipse

2008-06-18 Thread Brian Vargas
JMock also works rather well, using its cglib extensions, for mocking 
out fake FileSystem implementations, if you're expecting your code to 
make calls directly to the filesystem for some reason.


Brian

Matt Kent wrote:

JMock is a unit testing tool for creating mock objects. I use it to mock
things like OutputCollector and Reporter, so I can unit test mappers and
reducers without running a cluster. In other words, I'm just testing the
logic of the code within the map() and reduce() methods, and testing the
map and reduce separately. I'm not feeding it real data from HDFS or
running the code in a real cluster.

Matt

On Tue, 2008-06-17 at 18:50 -0700, Richard Zhang wrote:

I creates three virtual machines, each of them works as a node.
Does the JMock support debugging with multiple nodes cluster within Eclipse?
Could we set up breakpoints, trace the running steps of the map reduce
program?
Richard


On Mon, Jun 16, 2008 at 6:54 PM, Matt Kent [EMAIL PROTECTED] wrote:


The approach I've taken is to use JMock and create a unit test for the
mapreduce, then debug that within Eclipse on my workstation. For
performance debugging, I use YourKit on the cluster.

Matt

On Mon, 2008-06-16 at 16:58 -0700, Mori Bellamy wrote:

Hey Richard,

I'm interested in the same thing myself :D. I was researching it
earlier today, and the best I know to do is to use Eclipse's remote
debugging functionality (although this won't completely work. each map/
reduce task spawns on its on JVM, making debugging really hard). but
if you want, you can debug up until the mappers/reducers spawn. To do
this, you need to pass certain debug-flags into the JVM. So you'd need
to do export HADOOP_OPTS=myFLagsForRemoteDebug
and then you'd go to eclips -run-open debug dialog and set up remote
debugging with the correct port.

if you find out a way to debug the mappers/reducers on eclipse, let me
know :D


On Jun 16, 2008, at 3:10 PM, Richard Zhang wrote:


Hello Hadoopers:

Is there a way to debug the hadoop code from Eclipse IDE? I am
using Eclipse to read the source and build the project now.
How to start the hadoop jobs from Eclipse? Say if we can put the
server names, could we trace the running process through
eclipse, such as setting breakpoints, check the variable values?
That should be very helpful for development.
If anyone know how to do it, could you please give some info?
Thanks.

Richard











signature.asc
Description: OpenPGP digital signature


hadoop file system error

2008-06-18 Thread 晋光峰
Dears,

I use hadoop-0.16.4 to do some work and found a error which i can't get the
reasons.

The scenario is like this: In the reduce step, instead of using
OutputCollector to write result, i use FSDataOutputStream to write result to
files on HDFS(becouse i want to split the result by some rules). After the
job finished, i found that *some* files(but not all) are empty on HDFS. But
i'm sure in the reduce step the files are not empty since i added some logs
to read the generated file. It seems that some file's contents are lost
after the reduce step. Is anyone happen to face such errors? or it's a
hadoop bug?

Please help me to find the reason if you some guys know

Thanks  Regards
Guangfeng

-- 
Guangfeng Jin

Software Engineer

iZENEsoft (Shanghai) Co., Ltd


Re: data locality in HDFS

2008-06-18 Thread Dhruba Borthakur
HDFS uses the network topology to distribute and replicate data. An
admin has to configure a script that describes the network topology to
HDFS. This is specified by using the parameter
topology.script.file.name in the Configuration file. This has been
tested when nodes are on different subnets in the same data center.

This code might not be generic (and is not yet tested) to support
multiple-data centers.

One can extend this topology by implementing one's own implementation
and specifying the new jar using the config parameter
topology.node.switch.mapping.impl. You will find more details at
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Hadoop+Rack+Awareness

thanks,
dhruba


On Tue, Jun 17, 2008 at 10:18 PM, Ian Holsman (Lists) [EMAIL PROTECTED] wrote:
 hi.

 I want to run a distributed cluster, where i have say 20 machines/slaves in
 3 seperate data centers that belong to the same cluster.

 Ideally I would like the other machines in the data center to be able to
 upload files (apache log files in this case) onto the local slaves and then
 have map/red tasks do their magic without having to move data until the
 reduce phase where the amount of data will be smaller.

 does Hadoop have this functionality?
 how do people handle multi-datacenter logging with hadoop in this case? do
 you just copy the data into a centeral location?

 regards
 Ian



how can i save the JobClient info?

2008-06-18 Thread Daniel
Hi all,
  I'm new to Hadoop framework, i want to know when one MapReduce task is
finished, is there any easy way to save the total number of input/output
records to some file or variables?
 Thanks.


Re: Internet-Based Secure Clustered FS?

2008-06-18 Thread Chris Collins
Have you considered Amazon S3?  I dont know how secure your  
requirements are.  There are lots of companies using this for just  
offsite data storage and also with EC2.



C

On Jun 17, 2008, at 6:48 PM, Kenneth Miller wrote:


All,

  I'm looking for a solution that would allow me to securely use  
VPSs (hosted VMs) or hosted dedicated servers as nodes in a  
distributed file system. My bandwidth/speed requirements aren't  
high, space requirements are potentially huge and ever growing,  
superb security is a must, but I really don't want to worry about  
hosting the DFS in-house. Is there any solution that's capable of  
this and/or is there anyone currently doing this?


Regards,
Kenneth Miller




Re: dfs put fails

2008-06-18 Thread Alexander Arimond

Got a similar error when doing a mapreduce job on the master machine.
Mapping job is ok and in the end there are the right results in my
output folder, but the reduce hangs at 17% a very long time. Found this
in one of the task logs a view times:

...
2008-06-18 17:31:02,297 INFO org.apache.hadoop.mapred.ReduceTask:
task_200806181716_0001_r_00_0: Got 0 new map-outputs  0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures 
2008-06-18 17:31:02,297 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Got 0 known map output location(s); 
scheduling...
2008-06-18 17:31:02,297 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Scheduled 0 of 0 known outputs (0 slow hosts 
and 0 dup hosts)
2008-06-18 17:31:03,276 WARN org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 copy failed: 
task_200806181716_0001_m_01_0 from koeln
2008-06-18 17:31:03,276 WARN org.apache.hadoop.mapred.ReduceTask: 
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:152)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.init(HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient.java:323)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:788)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:729)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:977)
at 
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:139)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)

2008-06-18 17:31:03,276 INFO org.apache.hadoop.mapred.ReduceTask: Task 
task_200806181716_0001_r_00_0: Failed fetch #7 from 
task_200806181716_0001_m_01_0
2008-06-18 17:31:03,276 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
fetch map-output from task_200806181716_0001_m_01_0 even after 
MAX_FETCH_RETRIES_PER_MAP retries...  reporting to the JobTracker
2008-06-18 17:31:03,276 WARN org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 adding host koeln to penalty box, next 
contact in 150 seconds
2008-06-18 17:31:03,277 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Need 1 map output(s)
2008-06-18 17:31:03,317 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0: Got 0 new map-outputs  0 obsolete 
map-outputs from tasktracker and 1 map-outputs from previous failures
2008-06-18 17:31:03,317 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Got 1 known map output location(s); 
scheduling...
2008-06-18 17:31:03,317 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Scheduled 0 of 1 known outputs (1 slow hosts 
and 0 dup hosts)
2008-06-18 17:31:08,336 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Need 1 map output(s)
2008-06-18 17:31:08,337 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0: Got 0 new map-outputs  0 obsolete 
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-06-18 17:31:08,337 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Got 1 known map output location(s); 
scheduling...
2008-06-18 17:31:08,337 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Scheduled 0 of 1 known outputs (1 slow hosts 
and 0 dup hosts)
2008-06-18 17:31:13,356 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200806181716_0001_r_00_0 Need 1 map output(s)
...


Did i forget to open some ports? i opened 50010 for datanode and the
ports for dfs and jobtracker as specified in hadoop-site.xml.
If its a firewall problem, woudnt hadoop recognize that at startup, i.e.
that connections would be refused? 



On Wed, 2008-06-18 at 11:32 +0200, Alexander Arimond wrote: 
 Thank you, first tried the put from the master machine, which leads to
 the error. The put from the slave machine works. Guess youre right with
 the configuration parameters. Appears a bit strange to me, because the
 firewall settings and the 

Re: hadoop file system error

2008-06-18 Thread Konstantin Shvachko

Did you close those files?
If not they may be empty.


??? wrote:

Dears,

I use hadoop-0.16.4 to do some work and found a error which i can't get the
reasons.

The scenario is like this: In the reduce step, instead of using
OutputCollector to write result, i use FSDataOutputStream to write result to
files on HDFS(becouse i want to split the result by some rules). After the
job finished, i found that *some* files(but not all) are empty on HDFS. But
i'm sure in the reduce step the files are not empty since i added some logs
to read the generated file. It seems that some file's contents are lost
after the reduce step. Is anyone happen to face such errors? or it's a
hadoop bug?

Please help me to find the reason if you some guys know

Thanks  Regards
Guangfeng



Re: hadoop file system error

2008-06-18 Thread 晋光峰
i'm sure i close all the files in the reduce step. Any other reasons cause
this problem?

2008/6/18 Konstantin Shvachko [EMAIL PROTECTED]:

 Did you close those files?
 If not they may be empty.



 ??? wrote:

 Dears,

 I use hadoop-0.16.4 to do some work and found a error which i can't get
 the
 reasons.

 The scenario is like this: In the reduce step, instead of using
 OutputCollector to write result, i use FSDataOutputStream to write result
 to
 files on HDFS(becouse i want to split the result by some rules). After the
 job finished, i found that *some* files(but not all) are empty on HDFS.
 But
 i'm sure in the reduce step the files are not empty since i added some
 logs
 to read the generated file. It seems that some file's contents are lost
 after the reduce step. Is anyone happen to face such errors? or it's a
 hadoop bug?

 Please help me to find the reason if you some guys know

 Thanks  Regards
 Guangfeng




-- 
Guangfeng Jin

Software Engineer

iZENEsoft (Shanghai) Co., Ltd
Room 601 Marine Tower, No. 1 Pudong Ave.
Tel:86-21-68860698
Fax:86-21-68860699
Mobile: 86-13621906422
Company Website:www.izenesoft.com