RE: How can I increase the speed balancing?

2014-09-04 Thread Srikanth upputuri
AFAIK, this setting is meant to throttle bandwidth usage by balancer so that 
the balancing traffic will not severely impact the performance of the running 
jobs. Increasing this value will show effect only when there is enough total 
available bandwidth on the network. On an already overloaded network changing 
this value may not show much improvement. I suggest you look at the total 
network capacity, network usage on your datanodes to assess if there is 
sufficient room to increase your balancer bandwidth. Can you also try running 
balancer when there is no other traffic and see if changing this value has any 
impact.

Correction: Default balancer bandwidth is 1MB/s, not 1KB/s as I mentioned in my 
previous post. Sorry for the typo.

From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: 03 September 2014 17:38
To: user@hadoop.apache.org
Subject: RE: How can I increase the speed balancing?

I have also found that neither
dfsadmin - setBalanacerBandwidth
nor
dfs.datanode.balance.bandwidthPerSec’
have any notable effect on apparent balancer rate.  This is on Hadoop 2.2.0

john

From: cho ju il [mailto:tjst...@kgrid.co.kr]
Sent: Wednesday, September 03, 2014 12:55 AM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: RE: How can I increase the speed balancing?


Bandwidth is enough

And i use command bin/hdfs dfsadmin -setBalancerBandwidth 52428800 .

Yet balancing slow.

I think because the file transfer speed is slow and  move 5 files per 1 server .

-Original Message-
From: Srikanth 
upputurisrikanth.upput...@huawei.commailto:srikanth.upput...@huawei.com
To: 
user@hadoop.apache.orgmailto:user@hadoop.apache.orguser@hadoop.apache.orgmailto:user@hadoop.apache.org;
Cc:
Sent: 2014-09-03 (수) 14:10:24
Subject: RE: How can I increase the speed balancing?

I am not sure what you meant by ‘Bandwidth is not a lack of data nodes’ but 
have you configured the balancer bandwidth property 
‘dfs.datanode.balance.bandwidthPerSec’? If not it defaults to 1KB/s. You can 
increase this to improve the balancer speed. You may also set it dynamically 
using the command ‘dfsadmin -setBalanacerBandwidth newbandwidth’ before running 
balancer.





From: cho ju il [mailto:tjst...@kgrid.co.kr]
Sent: 03 September 2014 06:31
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: How can I increase the speed balancing?



hadoop version 2.4.1

Balancing speed is slow.

I think because the file transfer speed is slow.

Bandwidth is not a lack of data nodes.

How can I increase the speed balancing?









*** balancer log

2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.207:40010

2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.205:40010

2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.203:40010

2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.210:40010

2014-09-03 09:44:18,041 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.211:40010

2014-09-03 09:44:18,042 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.114:40010

2014-09-03 09:44:18,042 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.206:40010

2014-09-03 09:44:18,042 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.201:40010

2014-09-03 09:44:18,043 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.202:40010

2014-09-03 09:44:18,043 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/192.168.0.204:40010

2014-09-03 09:44:18,043 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 3 
over-utilized: [Source[192.168.0.203:40010, utilization=99.99693907952093], 
Source[192.168.0.201:40010, utilization=99.99713240471648], 
Source[192.168.0.202:40010, utilization=99.99652052169367]]

2014-09-03 09:44:18,043 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 2 
underutilized: [BalancerDatanode[192.168.0.211:40010, 
utilization=62.735024524531006], BalancerDatanode[192.168.0.114:40010, 
utilization=2.3174560700459224]]

2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 
Need to move 70.30 TB to make the cluster balanced.

2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 
Decided to move 10 GB bytes from 192.168.0.203:40010 to 192.168.0.211:40010

2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 
Decided to move 10 GB bytes from 192.168.0.201:40010 to 192.168.0.114:40010

2014-09-03 09:44:18,044 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 
Will move 20 GB in this iteration

2014-09-03 09:44:23,643 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 
Successfully moved 

Hadoop and Open Data (CKAN.org).

2014-09-04 Thread Henrik Aagaard Jørgensen
Dear all,

I'm very new to Hadoop as I'm still trying to grasp its value and  purpose. I 
do hope my question on this mailing list is OK.

I manage our open data platform at our municipality, using CKAN.org. It works 
very well for its purpose of showing data and adding API's to data.

However, I'm very interested in knowing more about Hadoop and if it would fit 
into a (open) data platform, as we are getting more and more data to show and 
to work with internally at our municipality.

However, I cannot figure out if it's the right purpose to use Hadoop for, if it 
is overkill or...

Could someone elaborate on such topic?

I've Googled around a lot and looked at various videos online and Hadoop seems 
to have it place, also in an open data platform environment.

Best regards,
Henrik


RE: question about matching java API with libHDFS

2014-09-04 Thread Liu, Yi A
You could refer to the header file: “src/main/native/libhdfs/hdfs.h”, you could 
get the APIs in detail.

Regards,
Yi Liu

From: Demai Ni [mailto:nid...@gmail.com]
Sent: Thursday, September 04, 2014 5:21 AM
To: user@hadoop.apache.org
Subject: question about matching java API with libHDFS

hi, folks,
I am currently using java to access HDFS. for example, I am using this API
 DFSclient.getNamenode().getBlockLocations(...)... to retrieve file block 
information.
Now I need to move the same logic into C/C++. so I am looking at libHDFS, and 
this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also using the 
hdfs_test.c for some reference. However, I couldn't find a way to easily figure 
out whether above Java API is exposed through libHDFS?
Probably not, since I couldn't find it. Then, it lead to my next question. Is 
there an easy way to plug in the libHDFS framework, to include additonal API?

thanks a lot for your suggestions
Demai


RE: HDFS balance

2014-09-04 Thread Jamal B
Yes.  We do it all the time.

The node which you move this cron job to only needs to have the hadoop
environment set up, and proper connectivity to the cluster in which it is
writing to.
On Sep 3, 2014 10:51 AM, John Lilley john.lil...@redpoint.net wrote:

 Can you run the load from an edge node that is not a DataNode?
 john

 John Lilley
 Chief Architect, RedPoint Global Inc.
 1515 Walnut Street | Suite 300 | Boulder, CO 80302
 T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
 Skype: jlilley.redpoint | john.lil...@redpoint.net | www.redpoint.net


 -Original Message-
 From: Georgi Ivanov [mailto:iva...@vesseltracker.com]
 Sent: Wednesday, September 03, 2014 1:56 AM
 To: user@hadoop.apache.org
 Subject: HDFS balance

 Hi,
 We have 11 nodes cluster.
 Every hour a cron job is started to upload one file( ~1GB) to Hadoop on
 node1. (plain hadoop fs -put)

 This way node1 is getting full because the first replica is always stored
 on the node where the command is executed.
 Every day i am running re-balance, but this seems to be not enough.
 The effect of this is :
 host1 4.7TB/5.3TB
 host[2-10] : 4.1/5.3

 So i am always out of space on host1.

 What i can do is , spread the job to all the nodes and execute the job on
 random host.
 I don't really like this solution as it involves some NFS mounts, security
 issues etc.

 Is there any better solution ?

 Thanks in advance.
 George




Re: Hadoop and Open Data (CKAN.org).

2014-09-04 Thread Alec Ten Harmsel
I would recommend using Hadoop only if you are ingesting a lot of data
and you need reasonable performance at scale. I would recommend starting
with using insert language/tool of choice to ingest and transform data
until that process starts taking too long.

For example, one of our researchers at the University of Michigan had to
process ~150GB of data. Using python, processing that data took about 45
minutes - it was not worth it to spend extra development time to run it
on Hadoop. This time will change depending on what you need to do and
the hardware available, naturally.

So until you need to frequently process large amounts of data, I'd stick
with something you're already familiar with.

Alec Ten Harmsel

On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:

 Dear all,

  

 I’m very new to Hadoop as I’m still trying to grasp its value and 
 purpose. I do hope my question on this mailing list is OK.

  

 I manage our open data platform at our municipality, using CKAN.org.
 It works very well for its purpose of showing data and adding API’s to
 data.

  

 However, I’m very interested in knowing more about Hadoop and if it
 would fit into a (open) data platform, as we are getting more and more
 data to show and to work with internally at our municipality.

  

 However, I cannot figure out if it’s the right purpose to use Hadoop
 for, if it is “overkill” or…

  

 Could someone elaborate on such topic?

  

 I’ve Googled around a lot and looked at various videos online and
 Hadoop seems to have it place, also in an open data platform environment.

  

 Best regards,

 Henrik




Re: Hadoop and Open Data (CKAN.org).

2014-09-04 Thread Mohan Radhakrishnan
I understand that coding MR jobs using a language is required but if we are
just processing large amounts of data (Machine Learning for example) we
could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably
short time. In this case the development effort is very less.


Thanks,
Mohan


On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel a...@alectenharmsel.com
wrote:

  I would recommend using Hadoop only if you are ingesting a lot of data
 and you need reasonable performance at scale. I would recommend starting
 with using insert language/tool of choice to ingest and transform data
 until that process starts taking too long.

 For example, one of our researchers at the University of Michigan had to
 process ~150GB of data. Using python, processing that data took about 45
 minutes - it was not worth it to spend extra development time to run it on
 Hadoop. This time will change depending on what you need to do and the
 hardware available, naturally.

 So until you need to frequently process large amounts of data, I'd stick
 with something you're already familiar with.

 Alec Ten Harmsel

 On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:

  Dear all,



 I’m very new to Hadoop as I’m still trying to grasp its value and
 purpose. I do hope my question on this mailing list is OK.



 I manage our open data platform at our municipality, using CKAN.org. It
 works very well for its purpose of showing data and adding API’s to data.



 However, I’m very interested in knowing more about Hadoop and if it would
 fit into a (open) data platform, as we are getting more and more data to
 show and to work with internally at our municipality.



 However, I cannot figure out if it’s the right purpose to use Hadoop for,
 if it is “overkill” or…



 Could someone elaborate on such topic?



 I’ve Googled around a lot and looked at various videos online and Hadoop
 seems to have it place, also in an open data platform environment.



 Best regards,

 Henrik





Re: Datanode can not start with error Error creating plugin: org.apache.hadoop.metrics2.sink.FileSink

2014-09-04 Thread Rich Haase
The reason you can't launch your datanode is:

*2014-09-04 10:20:01,677 FATAL
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain*
*java.net.BindException: Port in use: 0.0.0.0:50075 http://0.0.0.0:50075/*

It appears that you already have a datanode instance listening on port
50075, or you have some other process listening on that port.

The error you mentioned in the subject of your email is a warning message
and is cause by a file system permission issue:

*Caused by: java.io.FileNotFoundException: datanode-metrics.out (Permission
denied)*



On Wed, Sep 3, 2014 at 9:09 PM, ch huang justlo...@gmail.com wrote:

 hi,maillist:

i have a 10-worknode hadoop cluster using CDH 4.4.0 , one of my
 datanode ,one of it's disk is full

 , when i restart this datanode ,i get error


 STARTUP_MSG:   java = 1.7.0_45
 /
 2014-09-04 10:20:00,576 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: registered UNIX signal
 handlers for [TERM, HUP, INT]
 2014-09-04 10:20:01,457 INFO
 org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
 hadoop-metrics2.properties
 2014-09-04 10:20:01,465 WARN
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error creating sink
 'file'
 org.apache.hadoop.metrics2.impl.MetricsConfigException: Error creating
 plugin: org.apache.hadoop.metrics2.sink.FileSink
 at
 org.apache.hadoop.metrics2.impl.MetricsConfig.getPlugin(MetricsConfig.java:203)
 at
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.newSink(MetricsSystemImpl.java:478)
 at
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configureSinks(MetricsSystemImpl.java:450)
 at
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configure(MetricsSystemImpl.java:429)
 at
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.start(MetricsSystemImpl.java:180)
 at
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:156)
 at
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:54)
 at
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:50)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1792)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1728)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1751)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1904)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1925)
 Caused by: org.apache.hadoop.metrics2.MetricsException: Error creating
 datanode-metrics.out
 at org.apache.hadoop.metrics2.sink.FileSink.init(FileSink.java:53)
 at
 org.apache.hadoop.metrics2.impl.MetricsConfig.getPlugin(MetricsConfig.java:199)
 ... 12 more
 Caused by: java.io.FileNotFoundException: datanode-metrics.out (Permission
 denied)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at java.io.FileWriter.init(FileWriter.java:107)
 at org.apache.hadoop.metrics2.sink.FileSink.init(FileSink.java:48)
 ... 13 more
 2014-09-04 10:20:01,488 INFO
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink ganglia started
 2014-09-04 10:20:01,546 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
 period at 5 second(s).
 2014-09-04 10:20:01,546 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system
 started
 2014-09-04 10:20:01,547 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is ch15
 2014-09-04 10:20:01,569 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server at
 /0.0.0.0:50010
 2014-09-04 10:20:01,572 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is
 10485760 bytes/s
 2014-09-04 10:20:01,607 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
 2014-09-04 10:20:01,657 INFO org.apache.hadoop.http.HttpServer: Added
 global filter 'safety'
 (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
 2014-09-04 10:20:01,660 INFO org.apache.hadoop.http.HttpServer: Added
 filter static_user_filter
 (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
 context datanode
 2014-09-04 10:20:01,660 INFO org.apache.hadoop.http.HttpServer: Added
 filter static_user_filter
 (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
 context static
 2014-09-04 10:20:01,660 INFO org.apache.hadoop.http.HttpServer: Added
 filter static_user_filter
 (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
 context logs
 2014-09-04 10:20:01,664 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at
 0.0.0.0:50075
 

Re: question about matching java API with libHDFS

2014-09-04 Thread Demai Ni
hi, Yi A,

Thanks for your response. I took a look at hdfs.h and hdfs.c, it seems the
lib only exposes some of APIs, as there are a lot of other public methods
can be accessed through java API/client, but not implemented in libhdfs,
such as the one I am using now: DFSclient.getNamenode().
getBlockLocations(...)..

Is the libhdfs designed to limit the access? Thanks

Demai


On Thu, Sep 4, 2014 at 2:36 AM, Liu, Yi A yi.a@intel.com wrote:

  You could refer to the header file: “src/main/native/libhdfs/hdfs.h”,
 you could get the APIs in detail.



 Regards,

 Yi Liu



 *From:* Demai Ni [mailto:nid...@gmail.com]
 *Sent:* Thursday, September 04, 2014 5:21 AM
 *To:* user@hadoop.apache.org
 *Subject:* question about matching java API with libHDFS



 hi, folks,

 I am currently using java to access HDFS. for example, I am using this API
  DFSclient.getNamenode().getBlockLocations(...)... to retrieve file
 block information.

 Now I need to move the same logic into C/C++. so I am looking at libHDFS,
 and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also
 using the hdfs_test.c for some reference. However, I couldn't find a way to
 easily figure out whether above Java API is exposed through libHDFS?

 Probably not, since I couldn't find it. Then, it lead to my next question.
 Is there an easy way to plug in the libHDFS framework, to include additonal
 API?

 thanks a lot for your suggestions

 Demai



[ANN] Multireducers - run multiple reducers on the same mapreduce job

2014-09-04 Thread Elazar Leibovich
I'll appreciate reviews of the code and the API of multireducers - a way to
run a couple of map and reduce classes in the same MapReduce job.

Thanks,

https://github.com/elazarl/multireducers

Usage example:

MultiJob.create().
withMapper(SelectFirstField.class, Text.class, IntWritable.class).
withReducer(CountFirstField.class, 1).
withCombiner(CountFirstField.class).
withOutputFormat(TextOutputFormat.class, Text.class, IntWritable.class).
addTo(job);MultiJob.create().
withMapper(SelectSecondField.class, IntWritableInRange.class,
IntWritable.class).
withReducer(CountSecondField.class, 1).
withCombiner(CountSecondField.class).
withOutputFormat(TextOutputFormat.class, Text.class, IntWritable.class).
addTo(job);

Motivation:

Sometimes, one would like to run more than one MapReduce job on the same
input files.

A classic example, is one would like to select two different fields from a
CSV file with two different mappers, and count the distinct values for each
field.

Let's say we're having a CSV file with employee's names and height

john,120
john,130
joe,180
moe,190
dough,130

We want one MapReduce job to count how many employees we have for each name
(two johns in our cases), and also, how many employees do we have for each
height (two employees 130 cm high).

The code for the mappers looks like

// i = 0 for the first reducer, 1 for the secondprotected void
map(LongWritable key, Text value, Context context) {
context.write(new Text(value.toString().split(,)[i]), one);}

The code for the reducers looks like

protected void reduce(Text key, IterableIntWritable values, Context context) {
context.write(key, new IntWritable(Iterables.size(values));}


Re: Need some tutorials for Mapreduce written in Python

2014-09-04 Thread Andrew Ehrlich
Also when you look at examples pay attention to the Hadoop version. The java 
API has changed a bit which can be confusing.

On Aug 28, 2014, at 10:10 AM, Amar Singh amarsingh...@gmail.com wrote:

 Thank you to everyone who responded to this thread. I got couple of good 
 moves and got some good online courses to explore from to get some 
 fundamental understanding of the things. 
 
 Thanks
 Amar
 
 
 On Thu, Aug 28, 2014 at 10:15 AM, Sriram Balachander 
 sriram.balachan...@gmail.com wrote:
 Hadoop The Definitive Guide, Hadoop in action are good books and the course 
 in edureka is also good. 
 
 Regards
 Sriram
 
 
 On Wed, Aug 27, 2014 at 9:25 PM, thejas prasad thejch...@gmail.com wrote:
 Are any books for this as well?
 
 
 
 On Wed, Aug 27, 2014 at 8:30 PM, Marco Shaw marco.s...@gmail.com wrote:
 You might want to consider the Hadoop course on udacity.com.  I think it 
 provides a decent foundation to Hadoop/MapReduce with a focus on Python 
 (using the streaming API like Sebastiano mentions).
 
 Marco
 
 
 On Wed, Aug 27, 2014 at 3:13 PM, Amar Singh amarsingh...@gmail.com wrote:
 Hi Users,
 I am new to big data world and was in process of reading some material of 
 writing mapreduce using Python. 
 
 Any links or pointers in that direction will be really helpful.