Re: How to find out what file Hadoop is looking for

2012-04-03 Thread Chris White
This looks like a log dir problem:

 at 
org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:239)

looking through the source for JobLocalizer, it's trying to create a
folder under ${hadoop.log.dir}/userlogs. There's a similar question (i
assume it's yours) on StackOverflow:

http://stackoverflow.com/questions/9992566/hadoop-map-reduce-operation-is-failing-on-writing-output

Basically from your ps trace, the hadoop.log.dir is pointing to
/home/hadoopmachine/hadoop-1.0.1/libexec/../logs - check this folder
is writable by your 'hadoopmachine' user:

1000  4249  2.2  0.8 1181992 30176 ?   Sl   12:09   0:00
/usr/lib/jvm/java-6-openjdk/bin/java -Dproc_tasktracker -Xmx1000m
-Dhadoop.log.dir=/home/hadoopmachine/hadoop-1.0.1/libexec/../logs
-Dhadoop.log.file=hadoop-hadoopmachine-tasktracker-debian.log
-Dhadoop.home.dir=/home/hadoopmachine/hadoop-1.0.1/libexec/..
-Dhadoop.id.str=hadoopmachine -Dhadoop.root.logger=INFO,DRFA
-Dhadoop.security.logger=INFO,NullAppender
-Djava.library.path=/home/hadoopmachine/hadoop-1.0.1/libexec/../lib/native/Linux-i386-32
-Dhadoop.policy.file=hadoop-policy.xml -classpath



On Tue, Apr 3, 2012 at 2:34 PM, Bas Hickendorff
hickendorff...@gmail.com wrote:
 ps -ef | grep hadoop shows that it is indeed hadoopmachine that is
 running hadoop.

 I su'ed into the user hadoopmachine (which is also the standard user I
 login with in debian), and I can access the hdfs that way as well.

 The free space should also not be a problem:

 hadoopmachine@debian:~$ df -h
 Filesystem            Size  Used Avail Use% Mounted on
 /dev/sda1             7.6G  4.0G  3.2G  56% /
 tmpfs                 1.8G     0  1.8G   0% /lib/init/rw
 udev                  1.8G  172K  1.8G   1% /dev
 tmpfs                 1.8G     0  1.8G   0% /dev/shm


 I don't know if it is relevant, but it is on a virtual machine.


 Regards,

 Bas


 On Tue, Apr 3, 2012 at 8:17 PM, Harsh J ha...@cloudera.com wrote:
 The permissions look alright if TT too is run by 'hadoopmachine'. Can
 you also check if you have adequate space free, reported by df -h
 /home/hadoopmachine?

 On Tue, Apr 3, 2012 at 10:28 PM, Bas Hickendorff
 hickendorff...@gmail.com wrote:
 Thanks for your help!
 However, as far as I can see, the user has those rights.

 I have in mapred-ste.xml :

   property
      namemapred.local.dir/name
      value/home/hadoopmachine/hadoop_data/mapred/value
    finaltrue/final
   /property


 and the directories look like this:

 hadoopmachine@debian:~$ cd /home/hadoopmachine/hadoop_data/mapred
 hadoopmachine@debian:~/hadoop_data/mapred$ ls -lah
 total 24K
 drwxr-xr-x 6 hadoopmachine hadoopmachine 4.0K Apr  3 12:11 .
 drwxr-xr-x 6 hadoopmachine hadoopmachine 4.0K Apr  3 08:26 ..
 drwxr-xr-x 2 hadoopmachine hadoopmachine 4.0K Apr  3 12:10 taskTracker
 drwxr-xr-x 2 hadoopmachine hadoopmachine 4.0K Apr  3 12:10 tt_log_tmp
 drwx-- 2 hadoopmachine hadoopmachine 4.0K Apr  3 12:10 ttprivate
 drwxr-xr-x 2 hadoopmachine hadoopmachine 4.0K Apr  3 08:28 userlogs

 hadoopmachine@debian:~/hadoop_data/mapred$ cd ..
 hadoopmachine@debian:~/hadoop_data$ ls -lah
 total 24K
 drwxr-xr-x  6 hadoopmachine hadoopmachine 4.0K Apr  3 08:26 .
 drwxr-xr-x 31 hadoopmachine hadoopmachine 4.0K Apr  3 12:08 ..
 drwxr-xr-x  6 hadoopmachine hadoopmachine 4.0K Apr  3 12:10 data
 drwxr-xr-x  6 hadoopmachine hadoopmachine 4.0K Apr  3 12:11 mapred
 drwxr-xr-x  5 hadoopmachine hadoopmachine 4.0K Apr  3 12:09 name
 drwxr-xr-x  4 hadoopmachine hadoopmachine 4.0K Apr  3 10:11 tmp


 As far as I can see (but my linux permissions knowledge might be
 failing) the user hadoopmachine has rights on these folders. I
 confirmed that that user is indeed the user that runs the TaskTracker.

 Are there any other things I could check?


 Regards,

 Bas

 On Tue, Apr 3, 2012 at 6:12 PM, Harsh J ha...@cloudera.com wrote:
 Some of your TaskTrackers' mapred.local.dirs do not have proper r/w
 permissions set on them. Make sure they are owned by the user that
 runs the TT service and have read/write permission at least for that
 user.

 On Tue, Apr 3, 2012 at 6:58 PM, Bas Hickendorff
 hickendorff...@gmail.com wrote:
 Hello all,

 My map-reduce operation on Hadoop (running on Debian) is correctly
 starting and finding the input file. However, just after starting the
 map reduce, Hadoop tells me that it cannot find a file. Unfortunately,
 it does not state what file it cannot find, or where it is looking.
 Does someone now about what file error is? See below for the complete
 error.

 Since the java error is in the chmod() function (judging from the
 stack in the output), I assume it is a problem with the rights, but
 how do I know what rights to change if it gives me no path?

 Thanks in advance,

 Bas




 The output of the job:


 hadoopmachine@debian:~$ ./hadoop-1.0.1/bin/hadoop jar
 hadooptest/main.jar nl.mydomain.hadoop.debian.test.Main
 /user/hadoopmachine/input /user/hadoopmachine/output
 Warning: $HADOOP_HOME is deprecated.

 12/04/03 08:05:08 

Re: how to unit test my RawComparator

2012-04-01 Thread Chris White
When hadoop is merging spill outputs, or merging map outputs in the
reducer, then i can see two byte arrays being used.

WIth regards to pass by reference vs value, you're right, the byte
arrays are passed 'by value', but the value passed is a copy of the
reference to the byte array (if that makes sense).

http://www.javaworld.com/javaworld/javaqa/2000-05/03-qa-0526-pass.html


On Sun, Apr 1, 2012 at 1:32 AM, Jane Wayne jane.wayne2...@gmail.com wrote:
 chris,

 1. thanks, that approach to converting my custom key to byte[] works.

 2. on the issue of pass by reference or pass by value, (it's been a while
 since i've visited this issue), i'm pretty sure java is pass by value
 (regardless of whether the parameters are primitives or objects). when i
 put the code into debugger, the ids of byte[] b1 and byte[] b2 are equal.
 if this is indeed the same byte array, why not just pass it as one
 parameter instead of two? unless in some cases, b1 and b2 are not the same.
 this second issue is not terribly too important, because the interface
 defines two byte arrays to be passed in, and so there's not much i (we) can
 do about it.

 thanks for the help!

 On Sat, Mar 31, 2012 at 5:18 PM, Chris White chriswhite...@gmail.comwrote:

 You can serialize your Writables to a ByteArrayOutputStream and then
 get it's underlying byte array:

 ByteArrayOutputStream baos = new ByteArrayOutputStream();
 DataOutputStream dos = new DataOutputStream(baos);
 Writable myWritable = new Text(text);
 myWritable.write(dos);
 byte[] bytes = baos.toByteArray();

 I would recommend writing a few bytes to the DataOutputStream first -
 i always forget to respect the offset variables (s1 / s2), and this,
 depending on how well you write your unit test, should allow you to
 test that you are respecting them.

 The huge bytes arrays store the other Writables in the stream the are
 about to be run by the comparator.

 Finally, arrays in java are objects, so you're passing a reference to
 a byte array, not making a copy of the array.

 Chris

 On Sat, Mar 31, 2012 at 12:23 AM, Jane Wayne jane.wayne2...@gmail.com
 wrote:
  i have a RawComparator that i would like to unit test (using mockito and
  mrunit testing packages). i want to test the method,
 
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)
 
  how do i convert my custom key into a byte[] array? is there a util class
  to help me do this?
 
  also, when i put the code into the debugger, i notice that the byte[]
  arrays (b1 and b2) are HUGE (the lengths of each array are huge, in the
  thousands). what is actually in these byte[] arrays? intuitively, it does
  not seem like these byte[] arrays only represent my keys.
 
  lastly, why are such huge byte[] arrays being passed around? one would
  think that since Java is pass-by-value, there would be a large overhead
  with passing such large byte arrays around.
 
  your help is appreciated.



Re: what is the code for WritableComparator.readVInt and WritableUtils.decodeVIntSize doing?

2012-03-31 Thread Chris White
A text object is written out as a vint representing the number of bytes and
then the byte array contents of the text object

Because a vintage can be between 1-5 bytes in length, the decodeVIntSize
method examines the first byte of the vint to work out how many bytes to
skip over before the text bytes start.

readVInt then actually reads the vint bytes to get the length of the
following byte array.

So when you call the compareBytes method you need to pass in where the
actual bytes start (s1 + vIntLen) and how many bytes to compare (vint)
On Mar 31, 2012 12:38 AM, Jane Wayne jane.wayne2...@gmail.com wrote:

 in tom white's book, Hadoop, The Definitive Guide, in the second edition,
 on page 99, he shows how to compare the raw bytes of a key with Text
 fields. he shows an example like the following.

 int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
 int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);

 his explanation is that firstL1 is the length of the first String/Text in
 b1, and firstL2 is the length of the first String/Text in b2. but i'm
 unsure of what the code is actually doing.

 what is WritableUtils.decodeVIntSize(...) doing?
 what is WritableComparator.readVInt(...) doing?
 why do we have to add the outputs of these 2 methods to get the length of
 the String/Text?

 could someone please explain in plain terms what's happening here? it seems
 WritableComparator.readVInt(...) is already getting the length of the
 byte[] corresponding to the string. it seems
 WritableUtils.decodeVIntSize(...) is also doing the same thing (from
 reading the javadoc).

 when i look at WritableUtils.writeString(...), two things happen. the
 length of the byte[] is written, followed by writing the byte[] itself. why
 can't we simply do something like the following to get the length?

 int firstL1 = readInt(b1[s1]);
 int firstL2 = readInt(b2[s2]);



Re: how to unit test my RawComparator

2012-03-31 Thread Chris White
You can serialize your Writables to a ByteArrayOutputStream and then
get it's underlying byte array:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
Writable myWritable = new Text(text);
myWritable.write(dos);
byte[] bytes = baos.toByteArray();

I would recommend writing a few bytes to the DataOutputStream first -
i always forget to respect the offset variables (s1 / s2), and this,
depending on how well you write your unit test, should allow you to
test that you are respecting them.

The huge bytes arrays store the other Writables in the stream the are
about to be run by the comparator.

Finally, arrays in java are objects, so you're passing a reference to
a byte array, not making a copy of the array.

Chris

On Sat, Mar 31, 2012 at 12:23 AM, Jane Wayne jane.wayne2...@gmail.com wrote:
 i have a RawComparator that i would like to unit test (using mockito and
 mrunit testing packages). i want to test the method,

 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)

 how do i convert my custom key into a byte[] array? is there a util class
 to help me do this?

 also, when i put the code into the debugger, i notice that the byte[]
 arrays (b1 and b2) are HUGE (the lengths of each array are huge, in the
 thousands). what is actually in these byte[] arrays? intuitively, it does
 not seem like these byte[] arrays only represent my keys.

 lastly, why are such huge byte[] arrays being passed around? one would
 think that since Java is pass-by-value, there would be a large overhead
 with passing such large byte arrays around.

 your help is appreciated.


Re: how to unit test my RawComparator

2012-03-31 Thread Chris White
BytesWritable writes the size of the byte array as an int (4 bytes)
then the contents of the byte array (text.getBytes() == 4), so a
total of 8 bytes in total

On Sat, Mar 31, 2012 at 6:50 PM, Tom Melendez t...@supertom.com wrote:
 Hi Chris and all, hope you don't mind if I inject a question in here.
 It's highly related IMO (famous last words).

 On Sat, Mar 31, 2012 at 2:18 PM, Chris White chriswhite...@gmail.com wrote:
 You can serialize your Writables to a ByteArrayOutputStream and then
 get it's underlying byte array:

 ByteArrayOutputStream baos = new ByteArrayOutputStream();
 DataOutputStream dos = new DataOutputStream(baos);
 Writable myWritable = new Text(text);
 myWritable.write(dos);
 byte[] bytes = baos.toByteArray();


 I popped in this into a quick test and it failed.  What I want are the
 exact bytes back from the Writable (in my case, BytesWritable).  So,
 this fails for me:

        @Test
        public void byteswritabletest() {
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                DataOutputStream dos = new DataOutputStream(baos);
                BytesWritable myBW = new BytesWritable(test.getBytes());
                try {
                        myBW.write(dos);
                } catch (IOException e) {
                        e.printStackTrace();
                }
                byte[] bytes = baos.toByteArray();
                assertEquals(test.getBytes().length, bytes.length);  //I get
 expected: 4, actual 8 with this assertion
        }


 I see that in new versions of Text and BytesWritable, there is a
 .copyBytes() method that is available that gives us that.
 https://reviews.apache.org/r/182/diff/

 Is there another way (without the upgrade) to achieve that?

 Thanks,

 Tom


Re: Partition classes, how to pass in background information

2012-03-14 Thread Chris White
If your class implements the configurable interface, hadoop will call the
setConf method after creating the instance. Look in the source code for
ReflectionUtils.newInstance for more info
On Mar 14, 2012 2:31 AM, Jane Wayne jane.wayne2...@gmail.com wrote:

 i am using the new org.apache.hadoop.mapreduce.Partitioner class. however,
 i need to pass it some background information. how can i do this?

 in the old, org.apache.hadoop.mapred.Partitioner class (now deprecated),
 this class extends JobConfigurable, and it seems the hook to pass in any
 background data is with the JobConfigurable.configure(JobConf job) method.

 i thought that if i sub-classed org.apache.hadoop.mapreduce.Partitioner, i
 could pass in the background information, however, in the
 org.apache.hadoop.mapreduce.Job class, it only has a
 setPartitionerClass(Class? extends Partitioner) method.

 all my development has been the new mapreduce package, and i would
 definitely desire to stick with the new API/package. any help is
 appreciated.



Re: Hadoop in Action Partitioner Example

2011-08-23 Thread Chris White
Job.setGroupingComparatorClass will allow you to define a
RawComparator class, in which you can only compare the K1 component of
K. The Reduce sort will still sort all K's using the compareTo method
of K, but will use the grouping comparator when deciding which values
to pass to the reduce method.

On Tue, Aug 23, 2011 at 7:25 PM, Mehmet Tepedelenlioglu
mehmets...@gmail.com wrote:
 For those of you who has the book, on page 49 there is a custom partitioner 
 example. It basically describes a situation where the map emits K,V, but 
 the key is a compound key like (K1,K2), and we want to reduce over K1s and 
 not the whole of the Ks. This is used as an example of a situation where a 
 custom partitioner should be written to hash over K1 to send the right keys 
 to the same reducers. But as far as I know, although this would partition the 
 keys correctly (send them to the correct reducers), the reduce function would 
 still be called (grouped under) with the original keys K, not yielding the 
 desired results. The only way of doing this that I know of is to create a new 
 WritableComparable, that carries all of K, but only uses K1 for 
 hash/equal/compare methods, in which case you would not need to write your 
 own partitioner anyways. Am I misinterpreting something the author meant, or 
 is there something I don't know going on? It would have been sweet if I could 
 accomplish all that with just the partitioner. Either I am misunderstanding 
 something fundamental, or I am misunderstanding the example's intention, or 
 there is something wrong with it.

 Thanks,

 Mehmet




Re: WritableComparable

2011-08-16 Thread Chris White
Are you using a hash partioner? If so make sure the hash value of the
writable is not calculated using the hashCode value of the enum - use the
ordinal value instead. The hashcode value of an enum is different for each
jvm.


Re: WritableComparable

2011-08-16 Thread Chris White
Can you copy the contents of your parent Writable readField and write
methods (not the ones youve already posted)

Another thing you could try is if you know you have two identical keys, can
you write a unit test to examine the result of compareTo for two instances
to confirm the correct behavior (even going as far as serializing and
deserializing before the comparison)

Finally just to confirm, you dont have any group or order comparators
registered?


MultipleOutputs and new 20.1 API

2010-02-23 Thread Chris White
Does the code for MutipleOutputs work (as described in the Javadocs) for 
the new 20.1 API? This object expects a JobConf to most of its methods, 
which is deprecated in 20.1? If not then is there any plans to update 
MultipleOutputs to work with the new API?