Setting Number of Maps in 0.20.2

2011-08-14 Thread Jason Reed
I am trying to figure out how to set the number of maps to use in 0.20.2.

If I was using JobConf in my program I could use:

conf.setNumMapTasks(numMaps);

However JobConf and the method are deprecated and when we started our
project we structured everything to use Configuration and Job because of
this. Is there a way to set the number of map tasks using Job and
Configuration?

Thank you for any help,
Jason


Re: Question about RAID controllers and hadoop

2011-08-14 Thread Charles Wimmer
[cwimmer@hostname bonnie++-1.03e]$ ./bonnie++ -d . -s 5 -m P410 -f
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03e   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
P410 5M   68392  12 21153   3   116423   4 216.8   0
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 22238  33 + +++ + +++ + +++ + +++ + +++


P410,5M,,,68392,12,21153,3,,,116423,4,216.8,0,16,22238,33,+,+++,+,+++,+,+++,+,+++,+,+++



On 8/11/11 5:15 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

On Thu, Aug 11, 2011 at 3:26 PM, Charles Wimmer cwim...@yahoo-inc.com wrote:
 We currently use P410s in 12 disk system.  Each disk is set up as a RAID0 
 volume.  Performance is at least as good as a bare disk.

Can you please share what throughput you see with P410s? Are these SATA or SAS?



 On 8/11/11 3:23 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com 
 wrote:

 If I read that email chain correctly then they were referring to the classic 
 JBOD vs multiple disks striped together conversation. The conversation that 
 was started here is referring to JBOD vs 1 RAID 0 per disk and the effects of 
 the raid controller on those independent raids.

 Matt

 -Original Message-
 From: Kai Voigt [mailto:k...@123.org]
 Sent: Thursday, August 11, 2011 5:17 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Question about RAID controllers and hadoop

 Yahoo did some testing 2 years ago: 
 http://markmail.org/message/xmzc45zi25htr7ry

 But updated benchmark would be interesting to see.

 Kai

 Am 12.08.2011 um 00:13 schrieb GOEKE, MATTHEW (AG/1000):

 My assumption would be that having a set of 4 raid 0 disks would actually be 
 better than having a controller that allowed pure JBOD of 4 disks due to the 
 cache on the controller. If anyone has any personal experience with this I 
 would love to know performance numbers but our infrastructure guy is doing 
 tests on exactly this over the next couple days so I will pass it along once 
 we have it.

 Matt

 -Original Message-
 From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com]
 Sent: Thursday, August 11, 2011 5:00 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Question about RAID controllers and hadoop

 True, you need a P410 controller. You can create RAID0 for each disk to make 
 it as JBOD.


 -Bharath



 
 From: Koert Kuipers ko...@tresata.com
 To: common-user@hadoop.apache.org
 Sent: Thursday, August 11, 2011 2:50 PM
 Subject: Question about RAID controllers and hadoop

 Hello all,
 We are considering using low end HP proliant machines (DL160s and DL180s)
 for cluster nodes. However with these machines if you want to do more than 4
 hard drives then HP puts in a P410 raid controller. We would configure the
 RAID controller to function as JBOD, by simply creating multiple RAID
 volumes with one disk. Does anyone have experience with this setup? Is it a
 good idea, or am i introducing a i/o bottleneck?
 Thanks for your help!
 Best, Koert
 This e-mail message may contain privileged and/or confidential information, 
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error, 
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use 
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring, 
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for 
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage 
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export control 
 laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR) and 
 sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.



 --
 Kai Voigt

Is Hadoop suitable for my data distribution problem?

2011-08-14 Thread Thiago Moraes
Hey guys,

I'm new on the list and I'm currently considering Hadoop to solve a data
distribution problem. Right now, there's a server which contains very large
files (usual files have 30GB or even more). This server is accessed through
LAN and over the internet but, of course, it's difficult to do this without
local connection.

My idea to solve this problem is to deploy new servers on the places which
access data more often in an such a way that they get a local copy of the
files most accessed by then. These new servers would download and store
parts of the data (entire files) so that they can be accessed through their
own LAN alone, without needing to relieve on another server's data. Is it
possible to have this kind of limitation when splitting a file through
Hadoop's nodes?

In reality I don't know even if this restriction is useful. In my head,
enforcing this kind of data locality would make possible to use data
internally even if there is no internet connection, at the price of limiting
the number of nodes to balance the load and replication possibilities. Is
this tradeoff acceptable or at least possible with Hadoop?

thanks,

Thiago Moraes - EnC 07 - UFSCar


Re: hdfs format command issue

2011-08-14 Thread Giridharan Kesavan
this should help.

echo Y | ${hadoophdfshome}/bin/hdfs namenode -format

-giri

On Sat, Aug 13, 2011 at 8:41 AM, Dhodapkar, Chinmay
chinm...@qualcomm.comwrote:

 I am trying to automate the installation/bringup of a complete hadoop/hbase
 cluster from a single script. I have run into a very small issue...
 Before bringing up the namenode, I have to format it with the usual hadoop
 namenode -format

 Executing the above command prompts the user for Y/N?. Is there an option
 that can be passed to force the format without prompting?
 The aim is for the script to complete without any human intervention...






WritableComparable

2011-08-14 Thread Stan Rosenberg
Hi Folks,

After much poking around I am still unable to determine why I am seeing
'reduce' being called twice with the same key.
Recall from my previous email that sameness is determined by 'compareTo'
of my custom key type.

AFAIK, the default WritableComparator invokes 'compareTo' for any two keys
which are being ordered during sorting and merging.
Is it somehow possible that a bitwise comparator is used for the spilled map
output rather than the default WritableComparator?

I am out of clues, short of studying the shuffling code.  If anyone can
suggest some further debugging steps, don't be shy. :)

Thanks!!!

stan


Re: WritableComparable

2011-08-14 Thread Joey Echeverria
Does your compareTo() method test object pointer equality? If so, you could
be getting burned by Hadoop reusing Writable objects.

-Joey
On Aug 14, 2011 9:20 PM, Stan Rosenberg srosenb...@proclivitysystems.com
wrote:
 Hi Folks,

 After much poking around I am still unable to determine why I am seeing
 'reduce' being called twice with the same key.
 Recall from my previous email that sameness is determined by 'compareTo'
 of my custom key type.

 AFAIK, the default WritableComparator invokes 'compareTo' for any two keys
 which are being ordered during sorting and merging.
 Is it somehow possible that a bitwise comparator is used for the spilled
map
 output rather than the default WritableComparator?

 I am out of clues, short of studying the shuffling code. If anyone can
 suggest some further debugging steps, don't be shy. :)

 Thanks!!!

 stan


NPE in TaskLogAppender

2011-08-14 Thread aaron morton
I'm running the Cassandra Brisk server with Haddop core 20.203 on OSX, 
everything is local. 

I keep running into this problem for Hive jobs

 INFO 13:52:39,923 Error from attempt_201108151342_0001_m_01_1: 
java.lang.NullPointerException
at 
org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:264)
at org.apache.hadoop.mapred.Child$4.run(Child.java:261)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)

The only info I've found online was 
http://www.mail-archive.com/common-user@hadoop.apache.org/msg12829.html

Just for fun I tried…
* setting mapred.acls.enabled to true 
* setting mapred.queue.default.acl-submit-job and 
mapred.queue.default.acl-administer-jobs to *

There was no discernible increase in joy though. 

 Any thoughts ? 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com



Re: WritableComparable

2011-08-14 Thread Stan Rosenberg
On Sun, Aug 14, 2011 at 9:33 PM, Joey Echeverria j...@cloudera.com wrote:

 Does your compareTo() method test object pointer equality? If so, you could
 be getting burned by Hadoop reusing Writable objects.


Yes, but only the equality between enum values.  Interestingly, when
'reduce' is called there are three instances of the same key.
Two instances are correctly merged and they both come from the same mapper.
 The other instance comes from a different mapper, and for
some reason does not get merged.  I see the key and the values
(corresponding to the two merged instances) passed as arguments
to 'reduce'; then in subsequent 'reduce' call I see the key and the value
corresponding to the third instance.

For completeness, here is my 'Key.compareTo':

public int compareTo(Key o) {
   if (this.type != o.type) {
   // Type.X  Type.Y
   return (this.type == Type.X ? -1 : 1);
   }
   // otherwise, delegate
   if (this.type == Type.X) {
  return this.key1.compareTo(o.key1);
   } else {
  return this.key2.compareTo(o.key2);
   }
}

The 'type' field is an enum with two possible values, say X and Y.  Key is
essentially a union type; i.e., at any given time
it's the values in key1 or key2 that are being compared (depending on the
'type' value).


Re: WritableComparable

2011-08-14 Thread Stan Rosenberg
On Sun, Aug 14, 2011 at 10:25 PM, Joey Echeverria j...@cloudera.com wrote:

 What are the types of key1 and key2? What does the readFields() method
 look like?


The type of key1 is essentially a wrapper for java.util.UUID.
Here is its readFields:

public void readFields(DataInput in) throws IOException {
  id = new UUID(in.readLong(), in.readLong());
}

So, it reconstitutes the UUID by deserializing two longs.  The 'compareTo'
method of this key type delegates to java.util.UUID.compareTo.

The type of key2 wraps a different id, one that fits into a long.  In
addition to an id, it also stores an enum which designates the source of
this id.
Here is its readFields:

public void readFields(DataInput in) throws IOException {
  source = Source.values()[in.readByte()  0xFF];
  id = in.readLong();
}

The source is an enum value which is serialized by writing its ordinal.
 (There are only two possible enum values, hence only one byte.)
The 'compareTo' method of this key type orders by the source values if the
id values are different, otherwise by the id values.