NullPointerException when running multiple reducers with Hadoop 0.22.0-SNAPSHOT

2011-06-29 Thread Paolo Castagna

Hi,
I am using Apache Whirr to setup an Hadoop cluster on EC2 using Hadoop
0.22.0 SNAPSHOTs (nightly) builds from Jenkins. For details, see [1,2].
(Is there a better place where I can get nightly builds for Hadoop?)

I have a Reducer which does not emit any (key,value) pairs (it generates
only side effect files). When I run using only one reducer everything seems
fine. If I setup multiple reducers, each one generating different side effect
files, I get a NullPointerException:

 WARN  Exception running child : java.lang.NullPointerException
   at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:96)
   at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:239)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:225)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1153)

   at org.apache.hadoop.mapred.Child.main(Child.java:217)

Have you ever seen this exception/stacktrace?

My driver is here [3] and the reducer is here [4].

Regards,
Paolo

 [1] https://github.com/castagna/tdbloader3/blob/master/hadoop-ec2.properties
 [2] 
https://builds.apache.org/view/G-L/view/Hadoop/job/Hadoop-22-Build/lastSuccessfulBuild/artifact/hadoop-0.22.0-SNAPSHOT.tar.gz
 [3] 
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/ThirdDriver.java
 [4] 
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/ThirdReducer.java




Re: extremely imbalance in the hdfs cluster

2011-06-29 Thread 茅旭峰
Thanks Edward! It seems like we could only live with this issue.

On Wed, Jun 29, 2011 at 11:24 PM, Edward Capriolo wrote:

> We have run into this issue as well. Since hadoop is RR writing different
> size disks really screw things up royally especially if you are running at
> high capacity. We have found that decommissioning hosts for stretches of
> time is more effective then the balancer in extreme situations. Another
> hokey trick is that nodes that launch a job always use that node as the
> first replica. You can leverage that by launching jobs from your bigger
> machines which makes data more likely to be saved there. Super hokey
> solution is moving blocks around with rsync! (block reports later happen
> and
> deal with this (I do not suggest this)).
>
> Hadoop really does need a more intelligent system then Round Robin writing
> for heterogeneous systems, there might be a jira open on this somewhere.
> But
> if you are on 0.20.X you have to work with it.
>
> Edward
>
> On Wed, Jun 29, 2011 at 9:06 AM, 茅旭峰  wrote:
>
> > Hi,
> >
> > I'm running a 37 DN hdfs cluster. There are 12 nodes have 20TB capacity
> > each
> > node, and the other 25 nodes have 24TB each node.Unfortunately, there are
> > several nodes that contain much more data than others, and I can still
> see
> > the data increasing crazy. The 'dstat' shows
> >
> > dstat -ta 2
> > -time- total-cpu-usage -dsk/total- -net/total-
> ---paging--
> > ---system--
> >  date/time   |usr sys idl wai hiq siq| read  writ| recv  send|  in   out
> |
> > int   csw
> > 24-06 00:42:43|  1   1  95   2   0   0|  25M   62M|   0 0 |   0   0.1
> > |3532  5644
> > 24-06 00:42:45|  7   1  91   0   0   0|  16k  176k|8346B 1447k|   0 0
> > |1201   365
> > 24-06 00:42:47|  7   1  91   0   0   0|  12k  172k|9577B 1493k|   0 0
> > |1223   334
> > 24-06 00:42:49| 11   3  83   1   0   1|  26M   11M|  78M   66M|   0 0
> |
> >  12k   18k
> > 24-06 00:42:51|  4   3  90   1   0   2|  17M  181M| 117M   53M|   0 0
> |
> >  15k   26k
> > 24-06 00:42:53|  4   3  87   4   0   2|  15M  375M| 117M   55M|   0 0
> |
> >  16k   26k
> > 24-06 00:42:55|  3   2  94   1   0   1|  15M   37M|  80M   17M|   0 0
> |
> >  10k   15k
> > 24-06 00:42:57|  0   0  98   1   0   0|  18M   23M|7259k 5988k|   0 0
> > |1932  1066
> > 24-06 00:42:59|  0   0  98   1   0   0|  16M  132M| 708k  106k|   0 0
> > |1484   491
> > 24-06 00:43:01|  4   2  91   2   0   1|  23M   64M|  76M   41M|   0 0
> > |844113k
> > 24-06 00:43:03|  4   3  88   3   0   1|  17M  207M|  91M   48M|   0 0
> |
> >  11k   16k
> >
> > From the result of dstat, we can see that the throughput of write is much
> > more than read.
> > I've started a balancer processor, with dfs.balance.bandwidthPerSec set
> to
> > bytes. From
> > the balancer log, I can see the balancer works well. But the balance
> > operation can not
> > catch up with the write operation.
> >
> > Now I can only stop the mad increase of data size by stopping the
> datanode,
> > and setting
> > dfs.datanode.du.reserved 300GB, then starting the datanode again. Until
> the
> > total size
> > reaches the 300GB reservation line, the increase stopped.
> >
> > The output of 'hadoop dfsadmin -report' shows for the crazy nodes,
> >
> > Name: 10.150.161.88:50010
> > Decommission Status : Normal
> > Configured Capacity: 20027709382656 (18.22 TB)
> > DFS Used: 14515387866480 (13.2 TB)
> > Non DFS Used: 0 (0 KB)
> > DFS Remaining: 5512321516176(5.01 TB)
> > DFS Used%: 72.48%
> > DFS Remaining%: 27.52%
> > Last contact: Wed Jun 29 21:03:01 CST 2011
> >
> >
> > Name: 10.150.161.76:50010
> > Decommission Status : Normal
> > Configured Capacity: 20027709382656 (18.22 TB)
> > DFS Used: 16554450730194 (15.06 TB)
> > Non DFS Used: 0 (0 KB)
> > DFS Remaining: 3473258652462(3.16 TB)
> > DFS Used%: 82.66%
> > DFS Remaining%: 17.34%
> > Last contact: Wed Jun 29 21:03:02 CST 2011
> >
> > while the other normal datanode, it just like
> >
> > Name: 10.150.161.65:50010
> > Decommission Status : Normal
> > Configured Capacity: 23627709382656 (21.49 TB)
> > DFS Used: 5953984552236 (5.42 TB)
> > Non DFS Used: 1200643810004 (1.09 TB)
> > DFS Remaining: 16473081020416(14.98 TB)
> > DFS Used%: 25.2%
> > DFS Remaining%: 69.72%
> > Last contact: Wed Jun 29 21:03:01 CST 2011
> >
> >
> > Name: 10.150.161.80:50010
> > Decommission Status : Normal
> > Configured Capacity: 23627709382656 (21.49 TB)
> > DFS Used: 5982565373592 (5.44 TB)
> > Non DFS Used: 1202701691240 (1.09 TB)
> > DFS Remaining: 16442442317824(14.95 TB)
> > DFS Used%: 25.32%
> > DFS Remaining%: 69.59%
> > Last contact: Wed Jun 29 21:03:02 CST 2011
> >
> > Any hint on this issue? We are using 0.20.2-cdh3u0.
> >
> > Thanks and regards,
> >
> > Mao Xu-Feng
> >
>


Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-29 Thread Matthew Foley
I agree with Matei.  Whether you will get good ROI on 10GigE depends very much 
on the types of jobs you run.
--Matt

On Jun 28, 2011, at 4:02 PM, Matei Zaharia wrote:

Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile 
your target Hadoop workload and see whether it's communication-bound. Hadoop 
jobs can definitely be communication-bound if you shuffle a lot of data between 
map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to 
decompression, running python, or just running expensive user code) or 
disk-IO-bound. You might be surprised at what your bottleneck is.

Matei

On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:

> Matt,
> Thanks, this is helpful, I was wondering if you may have some thoughts
> on the list of other potential benefits of 10GbE NICs for Hadoop
> (listed in my original e-mail to the list)?
> 
> regards,
> Saqib
> 
> -Original Message-
> From: Matthew Foley [mailto:ma...@yahoo-inc.com] 
> Sent: Tuesday, June 28, 2011 12:04 PM
> To: common-user@hadoop.apache.org
> Cc: Matthew Foley
> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
> 
> Hadoop common provides an abstract FileSystem class, and Hadoop applications
> should be designed to run on that.  HDFS is just one implementation of a
> valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
> LocalFileSystem are provided in Hadoop common.  Use of NFS-mounted storage
> would fall under the LocalFileSystem model.
> 
> However, one of the core values of Hadoop is the model of "bring the
> computation to the data".  This does not seem viable with an NFS-based
> NAS-model storage subsystem.  Thus, while it will "work" for small clusters
> and small jobs, it is unlikely to scale with high performance to thousands
> of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.
> 
> --Matt
> 
> 
> On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:
> 
> I see. However, Hadoop is designed to operate best with HDFS because of its
> inherent striping and blocking strategy - which is tracked by Hadoop.
> Going outside of that mechanism will probably yield poor results and/or
> confuse Hadoop.
> 
> Just my thoughts.
> 
> On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
>> Darren,
>> Thanks, the last pt was basically about 10GbE potentially allowing the 
>> use of a network file system e.g. via NFS as an alternative to HDFS, 
>> the question is there any merit in this. Basically, I was exploring if 
>> the commercial clustered NAS products offer any high-availability or 
>> data management benefits for use with Hadoop?
>> 
>> Saqib
>> 
>> -Original Message-
>> From: Darren Govoni [mailto:dar...@ontrenet.com]
>> Sent: Tuesday, June 28, 2011 10:21 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
>> 
>> Hadoop, like other parallel networked computation architectures is I/O 
>> bound, predominantly.
>> This means any increase in network bandwidth is "A Good Thing" and can 
>> have drastic positive effects on performance. All your points stem 
>> from this simple realization.
>> 
>> Although I'm confused by your #6. Hadoop already uses a distributed 
>> file system. HDFS.
>> 
>> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
>>> Folks,
>>> 
>>> I've been digging into the potential benefits of using
>>> 
>>> 10 Gigabit Ethernet (10GbE) NIC server connections for
>>> 
>>> Hadoop and wanted to run what I've come up with
>>> 
>>> through initial research by the list for 'sanity check'
>>> 
>>> feedback. I'd very much appreciate your input on
>>> 
>>> the importance (or lack of it) of the following potential benefits of
>>> 
>>> 10GbE server connectivity as well as other thoughts regarding
>>> 
>>> 10GbE and Hadoop (My interest is specifically in the value
>>> 
>>> of 10GbE server connections and 10GbE switching infrastructure,
>>> 
>>> over scenarios such as bonded 1GbE server connections with
>>> 
>>> 10GbE switching).
>>> 
>>> 
>>> 
>>> 1.   HDFS Data Loading. The higher throughput enabled by 10GbE
>>> 
>>> server and switching infrastructure allows faster processing and
>>> 
>>> distribution of data.
>>> 
>>> 2.   Hadoop Cluster Scalability. High-performance for initial data
>>> processing
>>> 
>>> and distribution directly impacts the degree of parallelism or 
>>> scalability supported
>>> 
>>> by the cluster.
>>> 
>>> 3.   HDFS Replication. Higher speed server connections allows faster
>>> file replication.
>>> 
>>> 4.   Map/Reduce Shuffle Phase. Improved end-to-end throughput and
>>> latency directly impact the
>>> 
>>> shuffle phase of a data set reduction especially for tasks that are 
>>> at the document level
>>> 
>>> (including large documents) and lots of metadata generated by those 
>>> documents as well as video analytics and images.
>>> 
>>> 5.   Data Reporting. 10GbE server networking etwo

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-29 Thread Matthew Foley
I agree with Matei.  Whether you will get good ROI on 10GigE depends very much 
on the types of jobs you run.
--Matt

On Jun 28, 2011, at 4:02 PM, Matei Zaharia wrote:

Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile 
your target Hadoop workload and see whether it's communication-bound. Hadoop 
jobs can definitely be communication-bound if you shuffle a lot of data between 
map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to 
decompression, running python, or just running expensive user code) or 
disk-IO-bound. You might be surprised at what your bottleneck is.

Matei

On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:

> Matt,
> Thanks, this is helpful, I was wondering if you may have some thoughts
> on the list of other potential benefits of 10GbE NICs for Hadoop
> (listed in my original e-mail to the list)?
> 
> regards,
> Saqib
> 
> -Original Message-
> From: Matthew Foley [mailto:ma...@yahoo-inc.com] 
> Sent: Tuesday, June 28, 2011 12:04 PM
> To: common-user@hadoop.apache.org
> Cc: Matthew Foley
> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
> 
> Hadoop common provides an abstract FileSystem class, and Hadoop applications
> should be designed to run on that.  HDFS is just one implementation of a
> valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
> LocalFileSystem are provided in Hadoop common.  Use of NFS-mounted storage
> would fall under the LocalFileSystem model.
> 
> However, one of the core values of Hadoop is the model of "bring the
> computation to the data".  This does not seem viable with an NFS-based
> NAS-model storage subsystem.  Thus, while it will "work" for small clusters
> and small jobs, it is unlikely to scale with high performance to thousands
> of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.
> 
> --Matt
> 
> 
> On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:
> 
> I see. However, Hadoop is designed to operate best with HDFS because of its
> inherent striping and blocking strategy - which is tracked by Hadoop.
> Going outside of that mechanism will probably yield poor results and/or
> confuse Hadoop.
> 
> Just my thoughts.
> 
> On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
>> Darren,
>> Thanks, the last pt was basically about 10GbE potentially allowing the 
>> use of a network file system e.g. via NFS as an alternative to HDFS, 
>> the question is there any merit in this. Basically, I was exploring if 
>> the commercial clustered NAS products offer any high-availability or 
>> data management benefits for use with Hadoop?
>> 
>> Saqib
>> 
>> -Original Message-
>> From: Darren Govoni [mailto:dar...@ontrenet.com]
>> Sent: Tuesday, June 28, 2011 10:21 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
>> 
>> Hadoop, like other parallel networked computation architectures is I/O 
>> bound, predominantly.
>> This means any increase in network bandwidth is "A Good Thing" and can 
>> have drastic positive effects on performance. All your points stem 
>> from this simple realization.
>> 
>> Although I'm confused by your #6. Hadoop already uses a distributed 
>> file system. HDFS.
>> 
>> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
>>> Folks,
>>> 
>>> I've been digging into the potential benefits of using
>>> 
>>> 10 Gigabit Ethernet (10GbE) NIC server connections for
>>> 
>>> Hadoop and wanted to run what I've come up with
>>> 
>>> through initial research by the list for 'sanity check'
>>> 
>>> feedback. I'd very much appreciate your input on
>>> 
>>> the importance (or lack of it) of the following potential benefits of
>>> 
>>> 10GbE server connectivity as well as other thoughts regarding
>>> 
>>> 10GbE and Hadoop (My interest is specifically in the value
>>> 
>>> of 10GbE server connections and 10GbE switching infrastructure,
>>> 
>>> over scenarios such as bonded 1GbE server connections with
>>> 
>>> 10GbE switching).
>>> 
>>> 
>>> 
>>> 1.   HDFS Data Loading. The higher throughput enabled by 10GbE
>>> 
>>> server and switching infrastructure allows faster processing and
>>> 
>>> distribution of data.
>>> 
>>> 2.   Hadoop Cluster Scalability. High-performance for initial data
>>> processing
>>> 
>>> and distribution directly impacts the degree of parallelism or 
>>> scalability supported
>>> 
>>> by the cluster.
>>> 
>>> 3.   HDFS Replication. Higher speed server connections allows faster
>>> file replication.
>>> 
>>> 4.   Map/Reduce Shuffle Phase. Improved end-to-end throughput and
>>> latency directly impact the
>>> 
>>> shuffle phase of a data set reduction especially for tasks that are 
>>> at the document level
>>> 
>>> (including large documents) and lots of metadata generated by those 
>>> documents as well as video analytics and images.
>>> 
>>> 5.   Data Reporting. 10GbE server networking etwo

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-29 Thread Michel Segel
I'm not sure which point you are trying to make.
To answer to answer your question...

With respect to price... 10GBe is cost effective.
You have to consider 1GBe is not only you port speed but also there is going to 
be the speed of the uplink or trunk.

So if you continue to build out, you run in to bandwidth issues between racks. 
So you end up doing 1GBe ports and then higher speed by either port bonding or 
bigger bandwidth for uplinks only. These switches are more expensive than 
simple 1GBe switches, but less than full 10GBe.

Depending on vendor, number of ports, discount, you can get the switch for 
approx 10,000 and up. Think $550 to $600 a port for 10GBe.

With Sandy Bridge, you will start to see 10GBe on the mother boards.

If you're following discussion on the performance gains, you'll start to see 
the network being the bottleneck.

If you are planning to build a new cluster... You should plan on 10gbe.







Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 29, 2011, at 1:07 AM, Bharath Mundlapudi  wrote:
> One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly 
> having extra bandwidth is good but at what price?
> 
> 
> Please note that all the points you mentioned can work with 1Gb NICs today. 
> Unless if you can back with price/performance data. Typically, Map output is 
> compressed. If system is hitting peak network utilization, one can select 
> high compression rate algorithms at the cost of CPU.  Most of these machines 
> comes with dual NIC cards, so one could do link bonding to push more bits.
> 
> 
> One area may have good benefit of 10Gb NIC is High Density Systems - 24 core 
> and 3x12TB disks. This is the trend now and will continue. These systems can 
> saturate the 1Gb NICs. 
> 
> 
> -Bharath
> 
> 
> 
> 
> From: Saqib Jang -- Margalla Communications 
> To: common-user@hadoop.apache.org
> Sent: Tuesday, June 28, 2011 10:16 AM
> Subject: Sanity check re: value of 10GbE NICs for Hadoop?
> 
> Folks,
> 
> I've been digging into the potential benefits of using 
> 
> 10 Gigabit Ethernet (10GbE) NIC server connections for
> 
> Hadoop and wanted to run what I've come up with
> 
> through initial research by the list for 'sanity check'
> 
> feedback. I'd very much appreciate your input on
> 
> the importance (or lack of it) of the following potential benefits of
> 
> 10GbE server connectivity as well as other thoughts regarding
> 
> 10GbE and Hadoop (My interest is specifically in the value
> 
> of 10GbE server connections and 10GbE switching infrastructure, 
> 
> over scenarios such as bonded 1GbE server connections with 
> 
> 10GbE switching).
> 
> 
> 
> 1.   HDFS Data Loading. The higher throughput enabled by 10GbE
> 
> server and switching infrastructure allows faster processing and 
> 
> distribution of data.
> 
> 2.   Hadoop Cluster Scalability. High-performance for initial data
> processing
> 
> and distribution directly impacts the degree of parallelism or scalability
> supported
> 
> by the cluster.
> 
> 3.   HDFS Replication. Higher speed server connections allows faster
> file replication.
> 
> 4.   Map/Reduce Shuffle Phase. Improved end-to-end throughput and
> latency directly impact the 
> 
> shuffle phase of a data set reduction especially for tasks that are at the
> document level 
> 
> (including large documents) and lots of metadata generated by those
> documents as well as video analytics and images.
> 
> 5.   Data Reporting. 10GbE server networking etwork performance can 
> 
> improve data reporting performance, especially if the Hadoop cluster is
> running 
> 
> multiple data reductions. 
> 
> 6.   Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could be
> reorganized 
> 
> to use a cluster or network file system. This would allow Hadoop even with
> its Java implementation 
> 
> to have higher performance I/O and not have to be so concerned with disk
> drive density in the same server.
> 
> 7.   Others?
> 
> 
> 
> 
> 
> thanks,
> 
> Saqib
> 
> 
> 
> Saqib Jang
> 
> Principal/Founder
> 
> Margalla Communications, Inc.
> 
> 1339 Portola Road, Woodside, CA 94062
> 
> (650) 274 8745
> 
> www.margallacomm.com


Question about data sorting on Hadoop

2011-06-29 Thread Jingwei Lu
Hi Everyone:

I launched two experiments for sorting 1 Gb and 10 Gb data with hadoop, on
(1) a single machine (2) 5-node clustrer in LAN

The cmd is:

bin/hadoop jar hadoop-*-examples.jar sort [-m <#maps>] [-r <#reduces>]
 

the result is shown here:

[image: image.png]

Mapping shows good scalability. The thing is, reduce takes much longer time
than expected in cluster.
As far as I know, hadoop sort uses identity function for reduce, which
simply output the mapping
 result in a file. I tested LAN bandwidth, which is ~ 100Mbps, and the
average LAN flow during reduce
is about 10 Mbps (for sending and receiving).
as a result, it appears a bit weird to me here...

I am quite new in hadoop thus forgive me for any stupid questions here...

Thanks.

Best Regards
Yours Sincerely

Jingwei Lu


Re: extremely imbalance in the hdfs cluster

2011-06-29 Thread Edward Capriolo
We have run into this issue as well. Since hadoop is RR writing different
size disks really screw things up royally especially if you are running at
high capacity. We have found that decommissioning hosts for stretches of
time is more effective then the balancer in extreme situations. Another
hokey trick is that nodes that launch a job always use that node as the
first replica. You can leverage that by launching jobs from your bigger
machines which makes data more likely to be saved there. Super hokey
solution is moving blocks around with rsync! (block reports later happen and
deal with this (I do not suggest this)).

Hadoop really does need a more intelligent system then Round Robin writing
for heterogeneous systems, there might be a jira open on this somewhere. But
if you are on 0.20.X you have to work with it.

Edward

On Wed, Jun 29, 2011 at 9:06 AM, 茅旭峰  wrote:

> Hi,
>
> I'm running a 37 DN hdfs cluster. There are 12 nodes have 20TB capacity
> each
> node, and the other 25 nodes have 24TB each node.Unfortunately, there are
> several nodes that contain much more data than others, and I can still see
> the data increasing crazy. The 'dstat' shows
>
> dstat -ta 2
> -time- total-cpu-usage -dsk/total- -net/total- ---paging--
> ---system--
>  date/time   |usr sys idl wai hiq siq| read  writ| recv  send|  in   out |
> int   csw
> 24-06 00:42:43|  1   1  95   2   0   0|  25M   62M|   0 0 |   0   0.1
> |3532  5644
> 24-06 00:42:45|  7   1  91   0   0   0|  16k  176k|8346B 1447k|   0 0
> |1201   365
> 24-06 00:42:47|  7   1  91   0   0   0|  12k  172k|9577B 1493k|   0 0
> |1223   334
> 24-06 00:42:49| 11   3  83   1   0   1|  26M   11M|  78M   66M|   0 0 |
>  12k   18k
> 24-06 00:42:51|  4   3  90   1   0   2|  17M  181M| 117M   53M|   0 0 |
>  15k   26k
> 24-06 00:42:53|  4   3  87   4   0   2|  15M  375M| 117M   55M|   0 0 |
>  16k   26k
> 24-06 00:42:55|  3   2  94   1   0   1|  15M   37M|  80M   17M|   0 0 |
>  10k   15k
> 24-06 00:42:57|  0   0  98   1   0   0|  18M   23M|7259k 5988k|   0 0
> |1932  1066
> 24-06 00:42:59|  0   0  98   1   0   0|  16M  132M| 708k  106k|   0 0
> |1484   491
> 24-06 00:43:01|  4   2  91   2   0   1|  23M   64M|  76M   41M|   0 0
> |844113k
> 24-06 00:43:03|  4   3  88   3   0   1|  17M  207M|  91M   48M|   0 0 |
>  11k   16k
>
> From the result of dstat, we can see that the throughput of write is much
> more than read.
> I've started a balancer processor, with dfs.balance.bandwidthPerSec set to
> bytes. From
> the balancer log, I can see the balancer works well. But the balance
> operation can not
> catch up with the write operation.
>
> Now I can only stop the mad increase of data size by stopping the datanode,
> and setting
> dfs.datanode.du.reserved 300GB, then starting the datanode again. Until the
> total size
> reaches the 300GB reservation line, the increase stopped.
>
> The output of 'hadoop dfsadmin -report' shows for the crazy nodes,
>
> Name: 10.150.161.88:50010
> Decommission Status : Normal
> Configured Capacity: 20027709382656 (18.22 TB)
> DFS Used: 14515387866480 (13.2 TB)
> Non DFS Used: 0 (0 KB)
> DFS Remaining: 5512321516176(5.01 TB)
> DFS Used%: 72.48%
> DFS Remaining%: 27.52%
> Last contact: Wed Jun 29 21:03:01 CST 2011
>
>
> Name: 10.150.161.76:50010
> Decommission Status : Normal
> Configured Capacity: 20027709382656 (18.22 TB)
> DFS Used: 16554450730194 (15.06 TB)
> Non DFS Used: 0 (0 KB)
> DFS Remaining: 3473258652462(3.16 TB)
> DFS Used%: 82.66%
> DFS Remaining%: 17.34%
> Last contact: Wed Jun 29 21:03:02 CST 2011
>
> while the other normal datanode, it just like
>
> Name: 10.150.161.65:50010
> Decommission Status : Normal
> Configured Capacity: 23627709382656 (21.49 TB)
> DFS Used: 5953984552236 (5.42 TB)
> Non DFS Used: 1200643810004 (1.09 TB)
> DFS Remaining: 16473081020416(14.98 TB)
> DFS Used%: 25.2%
> DFS Remaining%: 69.72%
> Last contact: Wed Jun 29 21:03:01 CST 2011
>
>
> Name: 10.150.161.80:50010
> Decommission Status : Normal
> Configured Capacity: 23627709382656 (21.49 TB)
> DFS Used: 5982565373592 (5.44 TB)
> Non DFS Used: 1202701691240 (1.09 TB)
> DFS Remaining: 16442442317824(14.95 TB)
> DFS Used%: 25.32%
> DFS Remaining%: 69.59%
> Last contact: Wed Jun 29 21:03:02 CST 2011
>
> Any hint on this issue? We are using 0.20.2-cdh3u0.
>
> Thanks and regards,
>
> Mao Xu-Feng
>


extremely imbalance in the hdfs cluster

2011-06-29 Thread 茅旭峰
Hi,

I'm running a 37 DN hdfs cluster. There are 12 nodes have 20TB capacity each
node, and the other 25 nodes have 24TB each node.Unfortunately, there are
several nodes that contain much more data than others, and I can still see
the data increasing crazy. The 'dstat' shows

dstat -ta 2
-time- total-cpu-usage -dsk/total- -net/total- ---paging--
---system--
  date/time   |usr sys idl wai hiq siq| read  writ| recv  send|  in   out |
int   csw
24-06 00:42:43|  1   1  95   2   0   0|  25M   62M|   0 0 |   0   0.1
|3532  5644
24-06 00:42:45|  7   1  91   0   0   0|  16k  176k|8346B 1447k|   0 0
|1201   365
24-06 00:42:47|  7   1  91   0   0   0|  12k  172k|9577B 1493k|   0 0
|1223   334
24-06 00:42:49| 11   3  83   1   0   1|  26M   11M|  78M   66M|   0 0 |
 12k   18k
24-06 00:42:51|  4   3  90   1   0   2|  17M  181M| 117M   53M|   0 0 |
 15k   26k
24-06 00:42:53|  4   3  87   4   0   2|  15M  375M| 117M   55M|   0 0 |
 16k   26k
24-06 00:42:55|  3   2  94   1   0   1|  15M   37M|  80M   17M|   0 0 |
 10k   15k
24-06 00:42:57|  0   0  98   1   0   0|  18M   23M|7259k 5988k|   0 0
|1932  1066
24-06 00:42:59|  0   0  98   1   0   0|  16M  132M| 708k  106k|   0 0
|1484   491
24-06 00:43:01|  4   2  91   2   0   1|  23M   64M|  76M   41M|   0 0
|844113k
24-06 00:43:03|  4   3  88   3   0   1|  17M  207M|  91M   48M|   0 0 |
 11k   16k

>From the result of dstat, we can see that the throughput of write is much
more than read.
I've started a balancer processor, with dfs.balance.bandwidthPerSec set to
bytes. From
the balancer log, I can see the balancer works well. But the balance
operation can not
catch up with the write operation.

Now I can only stop the mad increase of data size by stopping the datanode,
and setting
dfs.datanode.du.reserved 300GB, then starting the datanode again. Until the
total size
reaches the 300GB reservation line, the increase stopped.

The output of 'hadoop dfsadmin -report' shows for the crazy nodes,

Name: 10.150.161.88:50010
Decommission Status : Normal
Configured Capacity: 20027709382656 (18.22 TB)
DFS Used: 14515387866480 (13.2 TB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 5512321516176(5.01 TB)
DFS Used%: 72.48%
DFS Remaining%: 27.52%
Last contact: Wed Jun 29 21:03:01 CST 2011


Name: 10.150.161.76:50010
Decommission Status : Normal
Configured Capacity: 20027709382656 (18.22 TB)
DFS Used: 16554450730194 (15.06 TB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 3473258652462(3.16 TB)
DFS Used%: 82.66%
DFS Remaining%: 17.34%
Last contact: Wed Jun 29 21:03:02 CST 2011

while the other normal datanode, it just like

Name: 10.150.161.65:50010
Decommission Status : Normal
Configured Capacity: 23627709382656 (21.49 TB)
DFS Used: 5953984552236 (5.42 TB)
Non DFS Used: 1200643810004 (1.09 TB)
DFS Remaining: 16473081020416(14.98 TB)
DFS Used%: 25.2%
DFS Remaining%: 69.72%
Last contact: Wed Jun 29 21:03:01 CST 2011


Name: 10.150.161.80:50010
Decommission Status : Normal
Configured Capacity: 23627709382656 (21.49 TB)
DFS Used: 5982565373592 (5.44 TB)
Non DFS Used: 1202701691240 (1.09 TB)
DFS Remaining: 16442442317824(14.95 TB)
DFS Used%: 25.32%
DFS Remaining%: 69.59%
Last contact: Wed Jun 29 21:03:02 CST 2011

Any hint on this issue? We are using 0.20.2-cdh3u0.

Thanks and regards,

Mao Xu-Feng


RE: conferences

2011-06-29 Thread Jeff.Schmitz
http://developer.yahoo.com/events/hadoopsummit2011/

There will also be a lot about Hadoop at OSCON

http://www.oscon.com/oscon2011

I believe Hadoop world is in NYC in November

-Original Message-
From: Keren Ouaknine [mailto:ker...@gmail.com] 
Sent: Wednesday, June 29, 2011 6:34 AM
To: common-user@hadoop.apache.org
Subject: conferences

Hello,

I would like to find the list of prestigious conferences related to
Hadoop.
Where can I find the list of these? Thanks!

Keren

-- 
Keren Ouaknine
Cell: +972 54 2565404
Web: www.kereno.com



Re: conferences

2011-06-29 Thread Eric Charles

On 29/06/11 13:33, Keren Ouaknine wrote:

Hello,

I would like to find the list of prestigious conferences related to Hadoop.
Where can I find the list of these? Thanks!

Keren



Hi,

You can try http://wiki.apache.org/hadoop/Conferences

I was just surfing this morning on:
http://developer.yahoo.com/events/hadoopsummit2011/
http://www.cloudera.com/company/events/hadoop-world-2011/

Thx
--
Eric


conferences

2011-06-29 Thread Keren Ouaknine
Hello,

I would like to find the list of prestigious conferences related to Hadoop.
Where can I find the list of these? Thanks!

Keren

-- 
Keren Ouaknine
Cell: +972 54 2565404
Web: www.kereno.com


Re: ReflectionUtils.setConf would configure anything Configurable twice?

2011-06-29 Thread steven zhuang
Anyone have the same partition problem using Streaming in 0.21.0?



On Fri, Jun 24, 2011 at 10:22 AM, steven zhuang wrote:

> hello, list,
>
>   Recently I have upgraded our Hadoop cluster from 0.20.2 to
> 0.21.0, and I found there is something wrong introduced by the upgrade.
>
>   In the setConf method of
> org.apache.hadoop.util.ReflectionUtils, any instance of Configurable would
> be configured twice, and this may cause trouble.
>   For example, in 0.21.0, KeyFieldBasedPartitioner implements
> the Configurable interface. When configured twice, it get two KeyDescription
> and gives out wrong partition number.
>
> I have created a ticket for this:
> https://issues.apache.org/jira/browse/HADOOP-7425
>
>
> Paste the source code in 0.21.0 (and 0.20.2 too) below:
>
> public static void setConf(Object theObject, Configuration conf) {
> if (conf != null) {
>if (theObject instanceof Configurable) {
>((Configurable) theObject).setConf(conf);
>}
>   setJobConf(theObject, conf);
>}
> }
>
>
>
> --
> best wishes.
>  steven
>
>


-- 
best wishes.
 steven


Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-29 Thread Paul Ingles
Hi,

I'm not familiar with wukong, but Mandy has some scripts that wrap the hadoop 
commands- the default behaviour IIRC is to package the folder the script is in.

This is then distributed so the app carries all its dependencies with it.

Happy to hear -files works for you.

Sent from my iPhone

On 29 Jun 2011, at 07:44, Guang-Nan Cheng  wrote:

> Well, my bad. I made a simple test and confirmed that  -files works that way
> already.
> 
> For the two guys that "answered" my question, sorry I asked the question
> unclearly... I don't see how those two projects related to the question,
> but thank you. :D
> 
> 
> 
> 
> On Wed, Jun 29, 2011 at 12:35 AM, Abhinay Mehta 
> wrote:
> 
>> We use Mandy: https://github.com/forward/mandy for this.
>> 
>> 
>> On 28 June 2011 17:26, Nick Jones  wrote:
>> 
>>> Take a look at Wukong from the guys at Infochimps:
>>> https://github.com/mrflip/**wukong 
>>> 
>>> 
>>> On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote:
>>> 
 I'm fancied about passing a whole ruby app to streaming, so I don't need
 to
 bother with ruby file dependencies.
 
 For example,
 
 ./streaming
 
 ...
 -mapper 'ruby aaa/bbb/ccc'
 -files  aaa<--- pass the folder
 
 
 
 
 Is this supported already? If not, any tips on how to make this work?
>> I'm
 willing to add some code by myself and rebuild the streaming jar.
 
>>> 
>>> --
>>> Nick Jones
>>> 
>>> 
>>> 
>>