Re: How to find out what file Hadoop is looking for
This looks like a log dir problem: at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:239) looking through the source for JobLocalizer, it's trying to create a folder under ${hadoop.log.dir}/userlogs. There's a similar question (i assume it's yours) on StackOverflow: http://stackoverflow.com/questions/9992566/hadoop-map-reduce-operation-is-failing-on-writing-output Basically from your ps trace, the hadoop.log.dir is pointing to /home/hadoopmachine/hadoop-1.0.1/libexec/../logs - check this folder is writable by your 'hadoopmachine' user: 1000 4249 2.2 0.8 1181992 30176 ? Sl 12:09 0:00 /usr/lib/jvm/java-6-openjdk/bin/java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/home/hadoopmachine/hadoop-1.0.1/libexec/../logs -Dhadoop.log.file=hadoop-hadoopmachine-tasktracker-debian.log -Dhadoop.home.dir=/home/hadoopmachine/hadoop-1.0.1/libexec/.. -Dhadoop.id.str=hadoopmachine -Dhadoop.root.logger=INFO,DRFA -Dhadoop.security.logger=INFO,NullAppender -Djava.library.path=/home/hadoopmachine/hadoop-1.0.1/libexec/../lib/native/Linux-i386-32 -Dhadoop.policy.file=hadoop-policy.xml -classpath On Tue, Apr 3, 2012 at 2:34 PM, Bas Hickendorff hickendorff...@gmail.com wrote: ps -ef | grep hadoop shows that it is indeed hadoopmachine that is running hadoop. I su'ed into the user hadoopmachine (which is also the standard user I login with in debian), and I can access the hdfs that way as well. The free space should also not be a problem: hadoopmachine@debian:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.6G 4.0G 3.2G 56% / tmpfs 1.8G 0 1.8G 0% /lib/init/rw udev 1.8G 172K 1.8G 1% /dev tmpfs 1.8G 0 1.8G 0% /dev/shm I don't know if it is relevant, but it is on a virtual machine. Regards, Bas On Tue, Apr 3, 2012 at 8:17 PM, Harsh J ha...@cloudera.com wrote: The permissions look alright if TT too is run by 'hadoopmachine'. Can you also check if you have adequate space free, reported by df -h /home/hadoopmachine? On Tue, Apr 3, 2012 at 10:28 PM, Bas Hickendorff hickendorff...@gmail.com wrote: Thanks for your help! However, as far as I can see, the user has those rights. I have in mapred-ste.xml : property namemapred.local.dir/name value/home/hadoopmachine/hadoop_data/mapred/value finaltrue/final /property and the directories look like this: hadoopmachine@debian:~$ cd /home/hadoopmachine/hadoop_data/mapred hadoopmachine@debian:~/hadoop_data/mapred$ ls -lah total 24K drwxr-xr-x 6 hadoopmachine hadoopmachine 4.0K Apr 3 12:11 . drwxr-xr-x 6 hadoopmachine hadoopmachine 4.0K Apr 3 08:26 .. drwxr-xr-x 2 hadoopmachine hadoopmachine 4.0K Apr 3 12:10 taskTracker drwxr-xr-x 2 hadoopmachine hadoopmachine 4.0K Apr 3 12:10 tt_log_tmp drwx-- 2 hadoopmachine hadoopmachine 4.0K Apr 3 12:10 ttprivate drwxr-xr-x 2 hadoopmachine hadoopmachine 4.0K Apr 3 08:28 userlogs hadoopmachine@debian:~/hadoop_data/mapred$ cd .. hadoopmachine@debian:~/hadoop_data$ ls -lah total 24K drwxr-xr-x 6 hadoopmachine hadoopmachine 4.0K Apr 3 08:26 . drwxr-xr-x 31 hadoopmachine hadoopmachine 4.0K Apr 3 12:08 .. drwxr-xr-x 6 hadoopmachine hadoopmachine 4.0K Apr 3 12:10 data drwxr-xr-x 6 hadoopmachine hadoopmachine 4.0K Apr 3 12:11 mapred drwxr-xr-x 5 hadoopmachine hadoopmachine 4.0K Apr 3 12:09 name drwxr-xr-x 4 hadoopmachine hadoopmachine 4.0K Apr 3 10:11 tmp As far as I can see (but my linux permissions knowledge might be failing) the user hadoopmachine has rights on these folders. I confirmed that that user is indeed the user that runs the TaskTracker. Are there any other things I could check? Regards, Bas On Tue, Apr 3, 2012 at 6:12 PM, Harsh J ha...@cloudera.com wrote: Some of your TaskTrackers' mapred.local.dirs do not have proper r/w permissions set on them. Make sure they are owned by the user that runs the TT service and have read/write permission at least for that user. On Tue, Apr 3, 2012 at 6:58 PM, Bas Hickendorff hickendorff...@gmail.com wrote: Hello all, My map-reduce operation on Hadoop (running on Debian) is correctly starting and finding the input file. However, just after starting the map reduce, Hadoop tells me that it cannot find a file. Unfortunately, it does not state what file it cannot find, or where it is looking. Does someone now about what file error is? See below for the complete error. Since the java error is in the chmod() function (judging from the stack in the output), I assume it is a problem with the rights, but how do I know what rights to change if it gives me no path? Thanks in advance, Bas The output of the job: hadoopmachine@debian:~$ ./hadoop-1.0.1/bin/hadoop jar hadooptest/main.jar nl.mydomain.hadoop.debian.test.Main /user/hadoopmachine/input /user/hadoopmachine/output Warning: $HADOOP_HOME is deprecated. 12/04/03 08:05:08
Re: how to unit test my RawComparator
When hadoop is merging spill outputs, or merging map outputs in the reducer, then i can see two byte arrays being used. WIth regards to pass by reference vs value, you're right, the byte arrays are passed 'by value', but the value passed is a copy of the reference to the byte array (if that makes sense). http://www.javaworld.com/javaworld/javaqa/2000-05/03-qa-0526-pass.html On Sun, Apr 1, 2012 at 1:32 AM, Jane Wayne jane.wayne2...@gmail.com wrote: chris, 1. thanks, that approach to converting my custom key to byte[] works. 2. on the issue of pass by reference or pass by value, (it's been a while since i've visited this issue), i'm pretty sure java is pass by value (regardless of whether the parameters are primitives or objects). when i put the code into debugger, the ids of byte[] b1 and byte[] b2 are equal. if this is indeed the same byte array, why not just pass it as one parameter instead of two? unless in some cases, b1 and b2 are not the same. this second issue is not terribly too important, because the interface defines two byte arrays to be passed in, and so there's not much i (we) can do about it. thanks for the help! On Sat, Mar 31, 2012 at 5:18 PM, Chris White chriswhite...@gmail.comwrote: You can serialize your Writables to a ByteArrayOutputStream and then get it's underlying byte array: ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); Writable myWritable = new Text(text); myWritable.write(dos); byte[] bytes = baos.toByteArray(); I would recommend writing a few bytes to the DataOutputStream first - i always forget to respect the offset variables (s1 / s2), and this, depending on how well you write your unit test, should allow you to test that you are respecting them. The huge bytes arrays store the other Writables in the stream the are about to be run by the comparator. Finally, arrays in java are objects, so you're passing a reference to a byte array, not making a copy of the array. Chris On Sat, Mar 31, 2012 at 12:23 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i have a RawComparator that i would like to unit test (using mockito and mrunit testing packages). i want to test the method, public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) how do i convert my custom key into a byte[] array? is there a util class to help me do this? also, when i put the code into the debugger, i notice that the byte[] arrays (b1 and b2) are HUGE (the lengths of each array are huge, in the thousands). what is actually in these byte[] arrays? intuitively, it does not seem like these byte[] arrays only represent my keys. lastly, why are such huge byte[] arrays being passed around? one would think that since Java is pass-by-value, there would be a large overhead with passing such large byte arrays around. your help is appreciated.
Re: what is the code for WritableComparator.readVInt and WritableUtils.decodeVIntSize doing?
A text object is written out as a vint representing the number of bytes and then the byte array contents of the text object Because a vintage can be between 1-5 bytes in length, the decodeVIntSize method examines the first byte of the vint to work out how many bytes to skip over before the text bytes start. readVInt then actually reads the vint bytes to get the length of the following byte array. So when you call the compareBytes method you need to pass in where the actual bytes start (s1 + vIntLen) and how many bytes to compare (vint) On Mar 31, 2012 12:38 AM, Jane Wayne jane.wayne2...@gmail.com wrote: in tom white's book, Hadoop, The Definitive Guide, in the second edition, on page 99, he shows how to compare the raw bytes of a key with Text fields. he shows an example like the following. int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1); int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2); his explanation is that firstL1 is the length of the first String/Text in b1, and firstL2 is the length of the first String/Text in b2. but i'm unsure of what the code is actually doing. what is WritableUtils.decodeVIntSize(...) doing? what is WritableComparator.readVInt(...) doing? why do we have to add the outputs of these 2 methods to get the length of the String/Text? could someone please explain in plain terms what's happening here? it seems WritableComparator.readVInt(...) is already getting the length of the byte[] corresponding to the string. it seems WritableUtils.decodeVIntSize(...) is also doing the same thing (from reading the javadoc). when i look at WritableUtils.writeString(...), two things happen. the length of the byte[] is written, followed by writing the byte[] itself. why can't we simply do something like the following to get the length? int firstL1 = readInt(b1[s1]); int firstL2 = readInt(b2[s2]);
Re: how to unit test my RawComparator
You can serialize your Writables to a ByteArrayOutputStream and then get it's underlying byte array: ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); Writable myWritable = new Text(text); myWritable.write(dos); byte[] bytes = baos.toByteArray(); I would recommend writing a few bytes to the DataOutputStream first - i always forget to respect the offset variables (s1 / s2), and this, depending on how well you write your unit test, should allow you to test that you are respecting them. The huge bytes arrays store the other Writables in the stream the are about to be run by the comparator. Finally, arrays in java are objects, so you're passing a reference to a byte array, not making a copy of the array. Chris On Sat, Mar 31, 2012 at 12:23 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i have a RawComparator that i would like to unit test (using mockito and mrunit testing packages). i want to test the method, public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) how do i convert my custom key into a byte[] array? is there a util class to help me do this? also, when i put the code into the debugger, i notice that the byte[] arrays (b1 and b2) are HUGE (the lengths of each array are huge, in the thousands). what is actually in these byte[] arrays? intuitively, it does not seem like these byte[] arrays only represent my keys. lastly, why are such huge byte[] arrays being passed around? one would think that since Java is pass-by-value, there would be a large overhead with passing such large byte arrays around. your help is appreciated.
Re: how to unit test my RawComparator
BytesWritable writes the size of the byte array as an int (4 bytes) then the contents of the byte array (text.getBytes() == 4), so a total of 8 bytes in total On Sat, Mar 31, 2012 at 6:50 PM, Tom Melendez t...@supertom.com wrote: Hi Chris and all, hope you don't mind if I inject a question in here. It's highly related IMO (famous last words). On Sat, Mar 31, 2012 at 2:18 PM, Chris White chriswhite...@gmail.com wrote: You can serialize your Writables to a ByteArrayOutputStream and then get it's underlying byte array: ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); Writable myWritable = new Text(text); myWritable.write(dos); byte[] bytes = baos.toByteArray(); I popped in this into a quick test and it failed. What I want are the exact bytes back from the Writable (in my case, BytesWritable). So, this fails for me: @Test public void byteswritabletest() { ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); BytesWritable myBW = new BytesWritable(test.getBytes()); try { myBW.write(dos); } catch (IOException e) { e.printStackTrace(); } byte[] bytes = baos.toByteArray(); assertEquals(test.getBytes().length, bytes.length); //I get expected: 4, actual 8 with this assertion } I see that in new versions of Text and BytesWritable, there is a .copyBytes() method that is available that gives us that. https://reviews.apache.org/r/182/diff/ Is there another way (without the upgrade) to achieve that? Thanks, Tom
Re: Partition classes, how to pass in background information
If your class implements the configurable interface, hadoop will call the setConf method after creating the instance. Look in the source code for ReflectionUtils.newInstance for more info On Mar 14, 2012 2:31 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i am using the new org.apache.hadoop.mapreduce.Partitioner class. however, i need to pass it some background information. how can i do this? in the old, org.apache.hadoop.mapred.Partitioner class (now deprecated), this class extends JobConfigurable, and it seems the hook to pass in any background data is with the JobConfigurable.configure(JobConf job) method. i thought that if i sub-classed org.apache.hadoop.mapreduce.Partitioner, i could pass in the background information, however, in the org.apache.hadoop.mapreduce.Job class, it only has a setPartitionerClass(Class? extends Partitioner) method. all my development has been the new mapreduce package, and i would definitely desire to stick with the new API/package. any help is appreciated.
Re: Hadoop in Action Partitioner Example
Job.setGroupingComparatorClass will allow you to define a RawComparator class, in which you can only compare the K1 component of K. The Reduce sort will still sort all K's using the compareTo method of K, but will use the grouping comparator when deciding which values to pass to the reduce method. On Tue, Aug 23, 2011 at 7:25 PM, Mehmet Tepedelenlioglu mehmets...@gmail.com wrote: For those of you who has the book, on page 49 there is a custom partitioner example. It basically describes a situation where the map emits K,V, but the key is a compound key like (K1,K2), and we want to reduce over K1s and not the whole of the Ks. This is used as an example of a situation where a custom partitioner should be written to hash over K1 to send the right keys to the same reducers. But as far as I know, although this would partition the keys correctly (send them to the correct reducers), the reduce function would still be called (grouped under) with the original keys K, not yielding the desired results. The only way of doing this that I know of is to create a new WritableComparable, that carries all of K, but only uses K1 for hash/equal/compare methods, in which case you would not need to write your own partitioner anyways. Am I misinterpreting something the author meant, or is there something I don't know going on? It would have been sweet if I could accomplish all that with just the partitioner. Either I am misunderstanding something fundamental, or I am misunderstanding the example's intention, or there is something wrong with it. Thanks, Mehmet
Re: WritableComparable
Are you using a hash partioner? If so make sure the hash value of the writable is not calculated using the hashCode value of the enum - use the ordinal value instead. The hashcode value of an enum is different for each jvm.
Re: WritableComparable
Can you copy the contents of your parent Writable readField and write methods (not the ones youve already posted) Another thing you could try is if you know you have two identical keys, can you write a unit test to examine the result of compareTo for two instances to confirm the correct behavior (even going as far as serializing and deserializing before the comparison) Finally just to confirm, you dont have any group or order comparators registered?
MultipleOutputs and new 20.1 API
Does the code for MutipleOutputs work (as described in the Javadocs) for the new 20.1 API? This object expects a JobConf to most of its methods, which is deprecated in 20.1? If not then is there any plans to update MultipleOutputs to work with the new API?