RE: How can I increase the speed balancing?
AFAIK, this setting is meant to throttle bandwidth usage by balancer so that the balancing traffic will not severely impact the performance of the running jobs. Increasing this value will show effect only when there is enough total available bandwidth on the network. On an already overloaded network changing this value may not show much improvement. I suggest you look at the total network capacity, network usage on your datanodes to assess if there is sufficient room to increase your balancer bandwidth. Can you also try running balancer when there is no other traffic and see if changing this value has any impact. Correction: Default balancer bandwidth is 1MB/s, not 1KB/s as I mentioned in my previous post. Sorry for the typo. From: John Lilley [mailto:john.lil...@redpoint.net] Sent: 03 September 2014 17:38 To: user@hadoop.apache.org Subject: RE: How can I increase the speed balancing? I have also found that neither dfsadmin - setBalanacerBandwidth nor dfs.datanode.balance.bandwidthPerSec’ have any notable effect on apparent balancer rate. This is on Hadoop 2.2.0 john From: cho ju il [mailto:tjst...@kgrid.co.kr] Sent: Wednesday, September 03, 2014 12:55 AM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: RE: How can I increase the speed balancing? Bandwidth is enough And i use command bin/hdfs dfsadmin -setBalancerBandwidth 52428800 . Yet balancing slow. I think because the file transfer speed is slow and move 5 files per 1 server . -Original Message- From: Srikanth upputurisrikanth.upput...@huawei.commailto:srikanth.upput...@huawei.com To: user@hadoop.apache.orgmailto:user@hadoop.apache.orguser@hadoop.apache.orgmailto:user@hadoop.apache.org; Cc: Sent: 2014-09-03 (수) 14:10:24 Subject: RE: How can I increase the speed balancing? I am not sure what you meant by ‘Bandwidth is not a lack of data nodes’ but have you configured the balancer bandwidth property ‘dfs.datanode.balance.bandwidthPerSec’? If not it defaults to 1KB/s. You can increase this to improve the balancer speed. You may also set it dynamically using the command ‘dfsadmin -setBalanacerBandwidth newbandwidth’ before running balancer. From: cho ju il [mailto:tjst...@kgrid.co.kr] Sent: 03 September 2014 06:31 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: How can I increase the speed balancing? hadoop version 2.4.1 Balancing speed is slow. I think because the file transfer speed is slow. Bandwidth is not a lack of data nodes. How can I increase the speed balancing? *** balancer log 2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.207:40010 2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.205:40010 2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.203:40010 2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.210:40010 2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.211:40010 2014-09-03 09:44:18,042 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.114:40010 2014-09-03 09:44:18,042 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.206:40010 2014-09-03 09:44:18,042 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.201:40010 2014-09-03 09:44:18,043 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.202:40010 2014-09-03 09:44:18,043 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.204:40010 2014-09-03 09:44:18,043 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 3 over-utilized: [Source[192.168.0.203:40010, utilization=99.99693907952093], Source[192.168.0.201:40010, utilization=99.99713240471648], Source[192.168.0.202:40010, utilization=99.99652052169367]] 2014-09-03 09:44:18,043 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 2 underutilized: [BalancerDatanode[192.168.0.211:40010, utilization=62.735024524531006], BalancerDatanode[192.168.0.114:40010, utilization=2.3174560700459224]] 2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 70.30 TB to make the cluster balanced. 2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 10 GB bytes from 192.168.0.203:40010 to 192.168.0.211:40010 2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 10 GB bytes from 192.168.0.201:40010 to 192.168.0.114:40010 2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 20 GB in this iteration 2014-09-03 09:44:23,643 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
Hadoop and Open Data (CKAN.org).
Dear all, I'm very new to Hadoop as I'm still trying to grasp its value and purpose. I do hope my question on this mailing list is OK. I manage our open data platform at our municipality, using CKAN.org. It works very well for its purpose of showing data and adding API's to data. However, I'm very interested in knowing more about Hadoop and if it would fit into a (open) data platform, as we are getting more and more data to show and to work with internally at our municipality. However, I cannot figure out if it's the right purpose to use Hadoop for, if it is overkill or... Could someone elaborate on such topic? I've Googled around a lot and looked at various videos online and Hadoop seems to have it place, also in an open data platform environment. Best regards, Henrik
RE: question about matching java API with libHDFS
You could refer to the header file: “src/main/native/libhdfs/hdfs.h”, you could get the APIs in detail. Regards, Yi Liu From: Demai Ni [mailto:nid...@gmail.com] Sent: Thursday, September 04, 2014 5:21 AM To: user@hadoop.apache.org Subject: question about matching java API with libHDFS hi, folks, I am currently using java to access HDFS. for example, I am using this API DFSclient.getNamenode().getBlockLocations(...)... to retrieve file block information. Now I need to move the same logic into C/C++. so I am looking at libHDFS, and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also using the hdfs_test.c for some reference. However, I couldn't find a way to easily figure out whether above Java API is exposed through libHDFS? Probably not, since I couldn't find it. Then, it lead to my next question. Is there an easy way to plug in the libHDFS framework, to include additonal API? thanks a lot for your suggestions Demai
RE: HDFS balance
Yes. We do it all the time. The node which you move this cron job to only needs to have the hadoop environment set up, and proper connectivity to the cluster in which it is writing to. On Sep 3, 2014 10:51 AM, John Lilley john.lil...@redpoint.net wrote: Can you run the load from an edge node that is not a DataNode? john John Lilley Chief Architect, RedPoint Global Inc. 1515 Walnut Street | Suite 300 | Boulder, CO 80302 T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077 Skype: jlilley.redpoint | john.lil...@redpoint.net | www.redpoint.net -Original Message- From: Georgi Ivanov [mailto:iva...@vesseltracker.com] Sent: Wednesday, September 03, 2014 1:56 AM To: user@hadoop.apache.org Subject: HDFS balance Hi, We have 11 nodes cluster. Every hour a cron job is started to upload one file( ~1GB) to Hadoop on node1. (plain hadoop fs -put) This way node1 is getting full because the first replica is always stored on the node where the command is executed. Every day i am running re-balance, but this seems to be not enough. The effect of this is : host1 4.7TB/5.3TB host[2-10] : 4.1/5.3 So i am always out of space on host1. What i can do is , spread the job to all the nodes and execute the job on random host. I don't really like this solution as it involves some NFS mounts, security issues etc. Is there any better solution ? Thanks in advance. George
Re: Hadoop and Open Data (CKAN.org).
I would recommend using Hadoop only if you are ingesting a lot of data and you need reasonable performance at scale. I would recommend starting with using insert language/tool of choice to ingest and transform data until that process starts taking too long. For example, one of our researchers at the University of Michigan had to process ~150GB of data. Using python, processing that data took about 45 minutes - it was not worth it to spend extra development time to run it on Hadoop. This time will change depending on what you need to do and the hardware available, naturally. So until you need to frequently process large amounts of data, I'd stick with something you're already familiar with. Alec Ten Harmsel On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote: Dear all, I’m very new to Hadoop as I’m still trying to grasp its value and purpose. I do hope my question on this mailing list is OK. I manage our open data platform at our municipality, using CKAN.org. It works very well for its purpose of showing data and adding API’s to data. However, I’m very interested in knowing more about Hadoop and if it would fit into a (open) data platform, as we are getting more and more data to show and to work with internally at our municipality. However, I cannot figure out if it’s the right purpose to use Hadoop for, if it is “overkill” or… Could someone elaborate on such topic? I’ve Googled around a lot and looked at various videos online and Hadoop seems to have it place, also in an open data platform environment. Best regards, Henrik
Re: Hadoop and Open Data (CKAN.org).
I understand that coding MR jobs using a language is required but if we are just processing large amounts of data (Machine Learning for example) we could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably short time. In this case the development effort is very less. Thanks, Mohan On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel a...@alectenharmsel.com wrote: I would recommend using Hadoop only if you are ingesting a lot of data and you need reasonable performance at scale. I would recommend starting with using insert language/tool of choice to ingest and transform data until that process starts taking too long. For example, one of our researchers at the University of Michigan had to process ~150GB of data. Using python, processing that data took about 45 minutes - it was not worth it to spend extra development time to run it on Hadoop. This time will change depending on what you need to do and the hardware available, naturally. So until you need to frequently process large amounts of data, I'd stick with something you're already familiar with. Alec Ten Harmsel On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote: Dear all, I’m very new to Hadoop as I’m still trying to grasp its value and purpose. I do hope my question on this mailing list is OK. I manage our open data platform at our municipality, using CKAN.org. It works very well for its purpose of showing data and adding API’s to data. However, I’m very interested in knowing more about Hadoop and if it would fit into a (open) data platform, as we are getting more and more data to show and to work with internally at our municipality. However, I cannot figure out if it’s the right purpose to use Hadoop for, if it is “overkill” or… Could someone elaborate on such topic? I’ve Googled around a lot and looked at various videos online and Hadoop seems to have it place, also in an open data platform environment. Best regards, Henrik
Re: Datanode can not start with error Error creating plugin: org.apache.hadoop.metrics2.sink.FileSink
The reason you can't launch your datanode is: *2014-09-04 10:20:01,677 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain* *java.net.BindException: Port in use: 0.0.0.0:50075 http://0.0.0.0:50075/* It appears that you already have a datanode instance listening on port 50075, or you have some other process listening on that port. The error you mentioned in the subject of your email is a warning message and is cause by a file system permission issue: *Caused by: java.io.FileNotFoundException: datanode-metrics.out (Permission denied)* On Wed, Sep 3, 2014 at 9:09 PM, ch huang justlo...@gmail.com wrote: hi,maillist: i have a 10-worknode hadoop cluster using CDH 4.4.0 , one of my datanode ,one of it's disk is full , when i restart this datanode ,i get error STARTUP_MSG: java = 1.7.0_45 / 2014-09-04 10:20:00,576 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: registered UNIX signal handlers for [TERM, HUP, INT] 2014-09-04 10:20:01,457 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2014-09-04 10:20:01,465 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error creating sink 'file' org.apache.hadoop.metrics2.impl.MetricsConfigException: Error creating plugin: org.apache.hadoop.metrics2.sink.FileSink at org.apache.hadoop.metrics2.impl.MetricsConfig.getPlugin(MetricsConfig.java:203) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.newSink(MetricsSystemImpl.java:478) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configureSinks(MetricsSystemImpl.java:450) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configure(MetricsSystemImpl.java:429) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.start(MetricsSystemImpl.java:180) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:156) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:54) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:50) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1792) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1728) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1751) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1904) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1925) Caused by: org.apache.hadoop.metrics2.MetricsException: Error creating datanode-metrics.out at org.apache.hadoop.metrics2.sink.FileSink.init(FileSink.java:53) at org.apache.hadoop.metrics2.impl.MetricsConfig.getPlugin(MetricsConfig.java:199) ... 12 more Caused by: java.io.FileNotFoundException: datanode-metrics.out (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileWriter.init(FileWriter.java:107) at org.apache.hadoop.metrics2.sink.FileSink.init(FileSink.java:48) ... 13 more 2014-09-04 10:20:01,488 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink ganglia started 2014-09-04 10:20:01,546 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 5 second(s). 2014-09-04 10:20:01,546 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started 2014-09-04 10:20:01,547 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is ch15 2014-09-04 10:20:01,569 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:50010 2014-09-04 10:20:01,572 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 10485760 bytes/s 2014-09-04 10:20:01,607 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2014-09-04 10:20:01,657 INFO org.apache.hadoop.http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2014-09-04 10:20:01,660 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context datanode 2014-09-04 10:20:01,660 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static 2014-09-04 10:20:01,660 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs 2014-09-04 10:20:01,664 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 0.0.0.0:50075
Re: question about matching java API with libHDFS
hi, Yi A, Thanks for your response. I took a look at hdfs.h and hdfs.c, it seems the lib only exposes some of APIs, as there are a lot of other public methods can be accessed through java API/client, but not implemented in libhdfs, such as the one I am using now: DFSclient.getNamenode(). getBlockLocations(...).. Is the libhdfs designed to limit the access? Thanks Demai On Thu, Sep 4, 2014 at 2:36 AM, Liu, Yi A yi.a@intel.com wrote: You could refer to the header file: “src/main/native/libhdfs/hdfs.h”, you could get the APIs in detail. Regards, Yi Liu *From:* Demai Ni [mailto:nid...@gmail.com] *Sent:* Thursday, September 04, 2014 5:21 AM *To:* user@hadoop.apache.org *Subject:* question about matching java API with libHDFS hi, folks, I am currently using java to access HDFS. for example, I am using this API DFSclient.getNamenode().getBlockLocations(...)... to retrieve file block information. Now I need to move the same logic into C/C++. so I am looking at libHDFS, and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also using the hdfs_test.c for some reference. However, I couldn't find a way to easily figure out whether above Java API is exposed through libHDFS? Probably not, since I couldn't find it. Then, it lead to my next question. Is there an easy way to plug in the libHDFS framework, to include additonal API? thanks a lot for your suggestions Demai
[ANN] Multireducers - run multiple reducers on the same mapreduce job
I'll appreciate reviews of the code and the API of multireducers - a way to run a couple of map and reduce classes in the same MapReduce job. Thanks, https://github.com/elazarl/multireducers Usage example: MultiJob.create(). withMapper(SelectFirstField.class, Text.class, IntWritable.class). withReducer(CountFirstField.class, 1). withCombiner(CountFirstField.class). withOutputFormat(TextOutputFormat.class, Text.class, IntWritable.class). addTo(job);MultiJob.create(). withMapper(SelectSecondField.class, IntWritableInRange.class, IntWritable.class). withReducer(CountSecondField.class, 1). withCombiner(CountSecondField.class). withOutputFormat(TextOutputFormat.class, Text.class, IntWritable.class). addTo(job); Motivation: Sometimes, one would like to run more than one MapReduce job on the same input files. A classic example, is one would like to select two different fields from a CSV file with two different mappers, and count the distinct values for each field. Let's say we're having a CSV file with employee's names and height john,120 john,130 joe,180 moe,190 dough,130 We want one MapReduce job to count how many employees we have for each name (two johns in our cases), and also, how many employees do we have for each height (two employees 130 cm high). The code for the mappers looks like // i = 0 for the first reducer, 1 for the secondprotected void map(LongWritable key, Text value, Context context) { context.write(new Text(value.toString().split(,)[i]), one);} The code for the reducers looks like protected void reduce(Text key, IterableIntWritable values, Context context) { context.write(key, new IntWritable(Iterables.size(values));}
Re: Need some tutorials for Mapreduce written in Python
Also when you look at examples pay attention to the Hadoop version. The java API has changed a bit which can be confusing. On Aug 28, 2014, at 10:10 AM, Amar Singh amarsingh...@gmail.com wrote: Thank you to everyone who responded to this thread. I got couple of good moves and got some good online courses to explore from to get some fundamental understanding of the things. Thanks Amar On Thu, Aug 28, 2014 at 10:15 AM, Sriram Balachander sriram.balachan...@gmail.com wrote: Hadoop The Definitive Guide, Hadoop in action are good books and the course in edureka is also good. Regards Sriram On Wed, Aug 27, 2014 at 9:25 PM, thejas prasad thejch...@gmail.com wrote: Are any books for this as well? On Wed, Aug 27, 2014 at 8:30 PM, Marco Shaw marco.s...@gmail.com wrote: You might want to consider the Hadoop course on udacity.com. I think it provides a decent foundation to Hadoop/MapReduce with a focus on Python (using the streaming API like Sebastiano mentions). Marco On Wed, Aug 27, 2014 at 3:13 PM, Amar Singh amarsingh...@gmail.com wrote: Hi Users, I am new to big data world and was in process of reading some material of writing mapreduce using Python. Any links or pointers in that direction will be really helpful.