data locality in HDFS
hi. I want to run a distributed cluster, where i have say 20 machines/slaves in 3 seperate data centers that belong to the same cluster. Ideally I would like the other machines in the data center to be able to upload files (apache log files in this case) onto the local slaves and then have map/red tasks do their magic without having to move data until the reduce phase where the amount of data will be smaller. does Hadoop have this functionality? how do people handle multi-datacenter logging with hadoop in this case? do you just copy the data into a centeral location? regards Ian
Re: dfs put fails
Thank you, first tried the put from the master machine, which leads to the error. The put from the slave machine works. Guess youre right with the configuration parameters. Appears a bit strange to me, because the firewall settings and the hadoop-site.xml on both machines are equal. On Tue, 2008-06-17 at 14:08 -0700, Konstantin Shvachko wrote: Looks like the client machine from which you call -put cannot connect to the data-nodes. It could be firewall or wrong configuration parameters that you use for the client. Alexander Arimond wrote: hi, i'm new in hadoop and im just testing it at the moment. i set up a cluster with 2 nodes and it seems like they are running normally, the log files of the namenode and the datanodes dont show errors. Firewall should be set right. but when i try to upload a file to the dfs i get following message: [EMAIL PROTECTED]:~/hadoop$ bin/hadoop dfs -put file.txt file.txt 08/06/12 14:44:19 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:19 INFO dfs.DFSClient: Abandoning block blk_5837981856060447217 08/06/12 14:44:28 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:28 INFO dfs.DFSClient: Abandoning block blk_2573458924311304120 08/06/12 14:44:37 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:37 INFO dfs.DFSClient: Abandoning block blk_1207459436305221119 08/06/12 14:44:46 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 08/06/12 14:44:46 INFO dfs.DFSClient: Abandoning block blk_-8263828216969765661 08/06/12 14:44:52 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. 08/06/12 14:44:52 WARN dfs.DFSClient: Error Recovery for block blk_-8263828216969765661 bad datanode[0] dont know what that means and didnt found something about that.. Hope somebody can help with that. Thank you!
Re: is there a way to to debug hadoop from Eclipse
JMock also works rather well, using its cglib extensions, for mocking out fake FileSystem implementations, if you're expecting your code to make calls directly to the filesystem for some reason. Brian Matt Kent wrote: JMock is a unit testing tool for creating mock objects. I use it to mock things like OutputCollector and Reporter, so I can unit test mappers and reducers without running a cluster. In other words, I'm just testing the logic of the code within the map() and reduce() methods, and testing the map and reduce separately. I'm not feeding it real data from HDFS or running the code in a real cluster. Matt On Tue, 2008-06-17 at 18:50 -0700, Richard Zhang wrote: I creates three virtual machines, each of them works as a node. Does the JMock support debugging with multiple nodes cluster within Eclipse? Could we set up breakpoints, trace the running steps of the map reduce program? Richard On Mon, Jun 16, 2008 at 6:54 PM, Matt Kent [EMAIL PROTECTED] wrote: The approach I've taken is to use JMock and create a unit test for the mapreduce, then debug that within Eclipse on my workstation. For performance debugging, I use YourKit on the cluster. Matt On Mon, 2008-06-16 at 16:58 -0700, Mori Bellamy wrote: Hey Richard, I'm interested in the same thing myself :D. I was researching it earlier today, and the best I know to do is to use Eclipse's remote debugging functionality (although this won't completely work. each map/ reduce task spawns on its on JVM, making debugging really hard). but if you want, you can debug up until the mappers/reducers spawn. To do this, you need to pass certain debug-flags into the JVM. So you'd need to do export HADOOP_OPTS=myFLagsForRemoteDebug and then you'd go to eclips -run-open debug dialog and set up remote debugging with the correct port. if you find out a way to debug the mappers/reducers on eclipse, let me know :D On Jun 16, 2008, at 3:10 PM, Richard Zhang wrote: Hello Hadoopers: Is there a way to debug the hadoop code from Eclipse IDE? I am using Eclipse to read the source and build the project now. How to start the hadoop jobs from Eclipse? Say if we can put the server names, could we trace the running process through eclipse, such as setting breakpoints, check the variable values? That should be very helpful for development. If anyone know how to do it, could you please give some info? Thanks. Richard signature.asc Description: OpenPGP digital signature
hadoop file system error
Dears, I use hadoop-0.16.4 to do some work and found a error which i can't get the reasons. The scenario is like this: In the reduce step, instead of using OutputCollector to write result, i use FSDataOutputStream to write result to files on HDFS(becouse i want to split the result by some rules). After the job finished, i found that *some* files(but not all) are empty on HDFS. But i'm sure in the reduce step the files are not empty since i added some logs to read the generated file. It seems that some file's contents are lost after the reduce step. Is anyone happen to face such errors? or it's a hadoop bug? Please help me to find the reason if you some guys know Thanks Regards Guangfeng -- Guangfeng Jin Software Engineer iZENEsoft (Shanghai) Co., Ltd
Re: data locality in HDFS
HDFS uses the network topology to distribute and replicate data. An admin has to configure a script that describes the network topology to HDFS. This is specified by using the parameter topology.script.file.name in the Configuration file. This has been tested when nodes are on different subnets in the same data center. This code might not be generic (and is not yet tested) to support multiple-data centers. One can extend this topology by implementing one's own implementation and specifying the new jar using the config parameter topology.node.switch.mapping.impl. You will find more details at http://hadoop.apache.org/core/docs/current/cluster_setup.html#Hadoop+Rack+Awareness thanks, dhruba On Tue, Jun 17, 2008 at 10:18 PM, Ian Holsman (Lists) [EMAIL PROTECTED] wrote: hi. I want to run a distributed cluster, where i have say 20 machines/slaves in 3 seperate data centers that belong to the same cluster. Ideally I would like the other machines in the data center to be able to upload files (apache log files in this case) onto the local slaves and then have map/red tasks do their magic without having to move data until the reduce phase where the amount of data will be smaller. does Hadoop have this functionality? how do people handle multi-datacenter logging with hadoop in this case? do you just copy the data into a centeral location? regards Ian
how can i save the JobClient info?
Hi all, I'm new to Hadoop framework, i want to know when one MapReduce task is finished, is there any easy way to save the total number of input/output records to some file or variables? Thanks.
Re: Internet-Based Secure Clustered FS?
Have you considered Amazon S3? I dont know how secure your requirements are. There are lots of companies using this for just offsite data storage and also with EC2. C On Jun 17, 2008, at 6:48 PM, Kenneth Miller wrote: All, I'm looking for a solution that would allow me to securely use VPSs (hosted VMs) or hosted dedicated servers as nodes in a distributed file system. My bandwidth/speed requirements aren't high, space requirements are potentially huge and ever growing, superb security is a must, but I really don't want to worry about hosting the DFS in-house. Is there any solution that's capable of this and/or is there anyone currently doing this? Regards, Kenneth Miller
Re: dfs put fails
Got a similar error when doing a mapreduce job on the master machine. Mapping job is ok and in the end there are the right results in my output folder, but the reduce hangs at 17% a very long time. Found this in one of the task logs a view times: ... 2008-06-18 17:31:02,297 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-06-18 17:31:02,297 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Got 0 known map output location(s); scheduling... 2008-06-18 17:31:02,297 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-06-18 17:31:03,276 WARN org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 copy failed: task_200806181716_0001_m_01_0 from koeln 2008-06-18 17:31:03,276 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.Socket.connect(Socket.java:519) at sun.net.NetworkClient.doConnect(NetworkClient.java:152) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:788) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:729) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:977) at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:139) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764) 2008-06-18 17:31:03,276 INFO org.apache.hadoop.mapred.ReduceTask: Task task_200806181716_0001_r_00_0: Failed fetch #7 from task_200806181716_0001_m_01_0 2008-06-18 17:31:03,276 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from task_200806181716_0001_m_01_0 even after MAX_FETCH_RETRIES_PER_MAP retries... reporting to the JobTracker 2008-06-18 17:31:03,276 WARN org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 adding host koeln to penalty box, next contact in 150 seconds 2008-06-18 17:31:03,277 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Need 1 map output(s) 2008-06-18 17:31:03,317 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 1 map-outputs from previous failures 2008-06-18 17:31:03,317 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Got 1 known map output location(s); scheduling... 2008-06-18 17:31:03,317 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Scheduled 0 of 1 known outputs (1 slow hosts and 0 dup hosts) 2008-06-18 17:31:08,336 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Need 1 map output(s) 2008-06-18 17:31:08,337 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-06-18 17:31:08,337 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Got 1 known map output location(s); scheduling... 2008-06-18 17:31:08,337 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Scheduled 0 of 1 known outputs (1 slow hosts and 0 dup hosts) 2008-06-18 17:31:13,356 INFO org.apache.hadoop.mapred.ReduceTask: task_200806181716_0001_r_00_0 Need 1 map output(s) ... Did i forget to open some ports? i opened 50010 for datanode and the ports for dfs and jobtracker as specified in hadoop-site.xml. If its a firewall problem, woudnt hadoop recognize that at startup, i.e. that connections would be refused? On Wed, 2008-06-18 at 11:32 +0200, Alexander Arimond wrote: Thank you, first tried the put from the master machine, which leads to the error. The put from the slave machine works. Guess youre right with the configuration parameters. Appears a bit strange to me, because the firewall settings and the
Re: hadoop file system error
Did you close those files? If not they may be empty. ??? wrote: Dears, I use hadoop-0.16.4 to do some work and found a error which i can't get the reasons. The scenario is like this: In the reduce step, instead of using OutputCollector to write result, i use FSDataOutputStream to write result to files on HDFS(becouse i want to split the result by some rules). After the job finished, i found that *some* files(but not all) are empty on HDFS. But i'm sure in the reduce step the files are not empty since i added some logs to read the generated file. It seems that some file's contents are lost after the reduce step. Is anyone happen to face such errors? or it's a hadoop bug? Please help me to find the reason if you some guys know Thanks Regards Guangfeng
Re: hadoop file system error
i'm sure i close all the files in the reduce step. Any other reasons cause this problem? 2008/6/18 Konstantin Shvachko [EMAIL PROTECTED]: Did you close those files? If not they may be empty. ??? wrote: Dears, I use hadoop-0.16.4 to do some work and found a error which i can't get the reasons. The scenario is like this: In the reduce step, instead of using OutputCollector to write result, i use FSDataOutputStream to write result to files on HDFS(becouse i want to split the result by some rules). After the job finished, i found that *some* files(but not all) are empty on HDFS. But i'm sure in the reduce step the files are not empty since i added some logs to read the generated file. It seems that some file's contents are lost after the reduce step. Is anyone happen to face such errors? or it's a hadoop bug? Please help me to find the reason if you some guys know Thanks Regards Guangfeng -- Guangfeng Jin Software Engineer iZENEsoft (Shanghai) Co., Ltd Room 601 Marine Tower, No. 1 Pudong Ave. Tel:86-21-68860698 Fax:86-21-68860699 Mobile: 86-13621906422 Company Website:www.izenesoft.com