Re: stable version
Yes, version 18.3 is the most stable one. It has added patches, without not-proven new functionality. 2009/2/11 Owen O'Malley omal...@apache.org: On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote: Maybe version 0.18 is better suited for production environment? Yahoo is mostly on 0.18.3 + some patches at this point. -- Owen -- M. Raşit ÖZDAŞ
Re: Reporter for Hadoop Streaming?
You can retrieve them from the command line using bin/hadoop job -counter job-id group-name counter-name Tom On Wed, Feb 11, 2009 at 12:20 AM, scruffy323 steve.mo...@gmail.com wrote: Do you know how to access those counters programmatically after the job has run? S D-5 wrote: This does it. Thanks! On Thu, Feb 5, 2009 at 9:14 PM, Arun C Murthy a...@yahoo-inc.com wrote: On Feb 5, 2009, at 1:40 PM, S D wrote: Is there a way to use the Reporter interface (or something similar such as Counters) with Hadoop streaming? Alternatively, can how could STDOUT be intercepted for the purpose of updates? If anyone could point me to documentation or examples that cover this I'd appreciate it. http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+counters+in+streaming+applications%3F http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+status+in+streaming+applications%3F Arun -- View this message in context: http://www.nabble.com/Reporter-for-Hadoop-Streaming--tp21861786p21945843.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: anybody knows an apache-license-compatible impl of Integer.parseInt?
Zheng Shao wrote: We need to implement a version of Integer.parseInt/atoi from byte[] instead of String to avoid the high cost of creating a String object. I wanted to take the open jdk code but the license is GPL: http://www.docjar.com/html/api/java/lang/Integer.java.html Does anybody know an implementation that I can use for hive (apache license)? I also need to do it for Byte, Short, Long, and Double. Just don't want to go over all the corner cases. Use the Apache Harmony code http://svn.apache.org/viewvc/harmony/enhanced/classlib/branches/java6/modules/
Re: File Transfer Rates
Brian Bockelman wrote: Just to toss out some numbers (and because our users are making interesting numbers right now) Here's our external network router: http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets Here's the application-level transfer graph: http://t2.unl.edu/phedex/graphs/quantity_rates?link=srcno_mss=trueto_node=Nebraska In a squeeze, we can move 20-50TB / day to/from other heterogenous sites. Usually, we run out of free space before we can find the upper limit for a 24-hour period. We use a protocol called GridFTP to move data back and forth between external (non-HDFS) clusters. The other sites we transfer with use niche software you probably haven't heard of (Castor, DPM, and dCache) because, well, it's niche software. I have no available data on HDFS-S3 systems, but I'd again claim it's mostly a function of the amount of hardware you throw at it and the size of your network pipes. There are currently 182 datanodes; 180 are traditional ones of 3TB and 2 are big honking RAID arrays of 40TB. Transfers are load-balanced amongst ~ 7 GridFTP servers which each have 1Gbps connection. GridFTP is optimised for high bandwidth network connections with negotiated packet size and multiple TCP connections, so when nagel's algorithm triggers backoff from a dropped packet, only a fraction of the transmission gets dropped. It is probably best-in-class for long haul transfers over the big university backbones where someone else pays for your traffic. You would be very hard pressed to get even close to that on any other protocol. I have no data on S3 xfers other than hearsay * write time to S3 can be slow as it doesn't return until the data is persisted somewhere. That's a better guarantee than a posix write operation. * you have to rely on other people on your rack not wanting all the traffic for themselves. That's an EC2 API issue: you don't get to request/buy bandwidth to/from S3 One thing to remember is that if you bring up a Hadoop cluster on any virtual server farm, disk IO is going to be way below physical IO rates. Even when the data is in HDFS, it will be slower to get at than dedicated high-RPM SCSI or SATA storage.
Hadoop setup questions
Good morning everyone, I have a question about correct setup for hadoop. I have 14 Dell computers in a lab. Each connected to the internet and each independent of each other. All run CentOS. Logins are handled by NIS. If userA logs into the master and starts the daemons and UserB logs into the master and wants to run a job while the daemons from UserA are still running the following error occurs: copyFromLocal: org.apache.hadoop.security.AccessControlException: Permission denied: user=UserB, access=WRITE, inode=user:UserA:supergroup:rwxr-xr-x what needs to be changed to allow UserB-UserZ to run their jobs? Does there need to be a local user the everyone logs into as and run from there? Should Hadoop be ran in an actual cluster instead of independent computers? Any ideas what is the correct configuration settings that allow it? I followed Ravi Phulari suggestions and followed: http://hadoop.apache.org/core/docs/current/quickstart.html http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 These allowed me to get Hadoop running on the 14 computers when I login and everything works fine, thank you Ravi. The problem occurs when additional people attempt to run jobs simultaneously. Thank you, Brian
Re: stable version
The particular problem I am having is this one: https://issues.apache.org/jira/browse/HADOOP-2669 I am observing it in version 19. Could anybody confirm that it have been fixed in 18, as Jira claims? I am wondering why bug fix for this problem might have been committed to 18 branch but not 19. If it was commited to both, then perhaps the problem was not completely solved and downgrading to 18 will not help me. Vadim On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote: Yes, version 18.3 is the most stable one. It has added patches, without not-proven new functionality. 2009/2/11 Owen O'Malley omal...@apache.org: On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote: Maybe version 0.18 is better suited for production environment? Yahoo is mostly on 0.18.3 + some patches at this point. -- Owen -- M. Raşit ÖZDAŞ
Finding small subset in very large dataset
Hi, Let's say the smaller subset has name A. It is a relatively small collection 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with 10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big ( 100 KB) For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is 100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: stable version
Vadim Zaliva wrote: The particular problem I am having is this one: https://issues.apache.org/jira/browse/HADOOP-2669 I am observing it in version 19. Could anybody confirm that it have been fixed in 18, as Jira claims? I am wondering why bug fix for this problem might have been committed to 18 branch but not 19. If it was commited to both, then perhaps the problem was not completely solved and downgrading to 18 will not help me. If you read through the comments, it will see that the the root cause was never found. The patch just fixes one of the suspects. If you are still seeing this, please file another jira and link it HADOOP-2669. How easy is it for you reproduce this? I guess one of the reasons for incomplete diagnosis is that it is not simple to reproduce. Raghu. Vadim On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote: Yes, version 18.3 is the most stable one. It has added patches, without not-proven new functionality. 2009/2/11 Owen O'Malley omal...@apache.org: On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote: Maybe version 0.18 is better suited for production environment? Yahoo is mostly on 0.18.3 + some patches at this point. -- Owen -- M. Raşit ÖZDAŞ
Re: Finding small subset in very large dataset
Are the keys in collection B unique? If so, I would like to try this approach: For each key, value of collection B, make a file out of it with file name given by MD5 hash of the key, and value being its content, and then store all these files into a HAR archive. The HAR archive will create an index for you over the keys. Now you can iterate over the collection A, get the MD5 hash of the key, and look up in the archive for the file (to get the value). On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote: Hi, Let's say the smaller subset has name A. It is a relatively small collection 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with 10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big ( 100 KB) For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is 100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
can't edit the file that mounted by fuse_dfs by editor
Hey all I was trying to edit the file that mounted by fuse_dfs by vi editor, but the contents could not save. The command is like the following: [had...@vm-centos-5-shu-4 src]$ vi /mnt/dfs/test.txt The error message from system log (/var/log/messages) is the following: Feb 12 09:53:48 VM-CentOS-5-SHU-4 fuse_dfs: ERROR: could not connect open file fuse_dfs.c:1340 I using the hadoop0.19.0 and fuse-dfs version 26 with centos5.2. Does anyone have an idea as to what could be wrong! Thanks! zhuweimin
Re: Finding small subset in very large dataset
I don't see why a HAR archive needs to be involved. You can use a MapFile to create a scannable index over a SequenceFile and do lookups that way. But if A is small enough to fit in RAM, then there is a much simpler way: Write it out to a file and disseminate to all mappers via the DistributedCache. They then reach read in the entire A set into a HashSet or other data structure during configure(), before they scan through their slices of B. They then emit only the B values which hit in A. This is called a map-side join. If you don't care about sorted ordering of your results, you can then disable the reducers entirely. Hive already supports this behavior; but you have to explicitly tell it to enable map-side joins for each query because only you know that one data set is small enough ahead of time. If your A set doesn't fit in RAM, you'll need to get more creative. One possibility is to do the same thing as above, but instead of reading all of A into memory, use a hash function to squash the keys from A into some bounded amount of RAM. For example, allocate yourself a 256 MB bitvector; for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1, then it may match an A key; if it's 0 then it definitely does not match an A key. For each potential match, send it to the reducer. Send all the A keys to the reducer as well, where the precise joining will occur. (Note: this is effectively the same thing as a Bloom Filter.) This will send much less data to each reducer and should see better throughput. - Aaron On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel amitchan...@gmail.com wrote: Are the keys in collection B unique? If so, I would like to try this approach: For each key, value of collection B, make a file out of it with file name given by MD5 hash of the key, and value being its content, and then store all these files into a HAR archive. The HAR archive will create an index for you over the keys. Now you can iterate over the collection A, get the MD5 hash of the key, and look up in the archive for the file (to get the value). On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote: Hi, Let's say the smaller subset has name A. It is a relatively small collection 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with 10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big ( 100 KB) For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is 100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Reducer Out of Memory
Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)
Re: Reducer Out of Memory
Maybe you need allocate larger vm- memory to use parameter -Xmx1024m On Thu, Feb 12, 2009 at 10:56 AM, Kris Jirapinyo kjirapi...@biz360.comwrote: Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)
Re: Reducer Out of Memory
Darn that send button. Anyways, so I was wondering if my understanding is correct. There will only be the exact same number of output files as the number of reducer tasks I set. Thus, in my output directory from the reducer, I should always see only 18 files. However, if my understanding is correct, then when I call the output.collect() in my reducer, does it only get flushed at the end when that particular reducer task finishes? If that is the case, then it does seem like as my input grow, 18 reducers will not be able to handle the sheer volume of my data, as the collector will keep having to add more and more data to it. Thus, I guess this is the question. Do I have to keep increasing the number of reduce tasks so that the reducer can take smaller bites out of the chunk? Thus, if I'm running out of java heap space and I don't want to add more nodes, then I need to set my reducer task number to say 36, etc.? It just seems like I'm missing something. Of course, I could always add more nodes or upgrade to a higher instance so I get more memory, but that's the obvious solution (I just hope it's not the only solution). I guess what I'm saying is that I thought the reducer would be kind of smart enough to know that it's taking too big of a bite out of the whole chunk (like the mapper) and readjust itself, as I don't really care how many output files I get in the end, just that the result from the reducer stays under one directory. On Wed, Feb 11, 2009 at 6:56 PM, Kris Jirapinyo kjirapi...@biz360.comwrote: Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)
Re: Reducer Out of Memory
I tried that, but with 1.7GB, that will not allow me to run 1 mapper and 1 reducer concurrently (as I think when you do -Xmx1024m it tries to reserve that physical memory?). Thus, to be safe, I set it to -Xmx768m. The error I get when I do 1024m is this: java.io.IOException: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2079) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:457) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380) at org.apache.hadoop.mapred.Child.main(Child.java:155) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 10 more On Wed, Feb 11, 2009 at 7:02 PM, Rocks Lei Wang beyiw...@gmail.com wrote: Maybe you need allocate larger vm- memory to use parameter -Xmx1024m On Thu, Feb 12, 2009 at 10:56 AM, Kris Jirapinyo kjirapi...@biz360.com wrote: Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)
Re: Hadoop setup questions
bjday wrote: Good morning everyone, I have a question about correct setup for hadoop. I have 14 Dell computers in a lab. Each connected to the internet and each independent of each other. All run CentOS. Logins are handled by NIS. If userA logs into the master and starts the daemons and UserB logs into the master and wants to run a job while the daemons from UserA are still running the following error occurs: copyFromLocal: org.apache.hadoop.security.AccessControlException: Permission denied: user=UserB, access=WRITE, inode=user:UserA:supergroup:rwxr-xr-x Looks like one of your files (input or output) is of different user. Seems like your DFS has permissions enabled. If you dont require permissions then disable it else make sure that the input/output paths are under your permission (/user/userB is the hone directory for userB). Amar what needs to be changed to allow UserB-UserZ to run their jobs? Does there need to be a local user the everyone logs into as and run from there? Should Hadoop be ran in an actual cluster instead of independent computers? Any ideas what is the correct configuration settings that allow it? I followed Ravi Phulari suggestions and followed: http://hadoop.apache.org/core/docs/current/quickstart.html http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 These allowed me to get Hadoop running on the 14 computers when I login and everything works fine, thank you Ravi. The problem occurs when additional people attempt to run jobs simultaneously. Thank you, Brian
Re: Hadoop setup questions
Like Amar said. Try adding property namedfs.permissions/name valuefalse/value /property to your conf/hadoop-site.xml file (or flip the value in hadoop-default.xml), restart your daemons and give it a whirl. cheers, -jw On Wed, Feb 11, 2009 at 8:44 PM, Amar Kamat ama...@yahoo-inc.com wrote: bjday wrote: Good morning everyone, I have a question about correct setup for hadoop. I have 14 Dell computers in a lab. Each connected to the internet and each independent of each other. All run CentOS. Logins are handled by NIS. If userA logs into the master and starts the daemons and UserB logs into the master and wants to run a job while the daemons from UserA are still running the following error occurs: copyFromLocal: org.apache.hadoop.security.AccessControlException: Permission denied: user=UserB, access=WRITE, inode=user:UserA:supergroup:rwxr-xr-x Looks like one of your files (input or output) is of different user. Seems like your DFS has permissions enabled. If you dont require permissions then disable it else make sure that the input/output paths are under your permission (/user/userB is the hone directory for userB). Amar what needs to be changed to allow UserB-UserZ to run their jobs? Does there need to be a local user the everyone logs into as and run from there? Should Hadoop be ran in an actual cluster instead of independent computers? Any ideas what is the correct configuration settings that allow it? I followed Ravi Phulari suggestions and followed: http://hadoop.apache.org/core/docs/current/quickstart.html http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 These allowed me to get Hadoop running on the 14 computers when I login and everything works fine, thank you Ravi. The problem occurs when additional people attempt to run jobs simultaneously. Thank you, Brian
Re: Loading native libraries
I have also the same problem. It would be wonderful if someone has some info about this.. Rasit 2009/2/10 Mimi Sun m...@rapleaf.com: I see UnsatisfiedLinkError. Also I'm calling System.getProperty(java.library.path) in the reducer and logging it. The only thing that prints out is ...hadoop-0.18.2/bin/../lib/native/Mac_OS_X-i386-32 I'm using Cascading, not sure if that affects anything. - Mimi On Feb 10, 2009, at 11:40 AM, Arun C Murthy wrote: On Feb 10, 2009, at 11:06 AM, Mimi Sun wrote: Hi, I'm new to Hadoop and I'm wondering what the recommended method is for using native libraries in mapred jobs. I've tried the following separately: 1. set LD_LIBRARY_PATH in .bashrc 2. set LD_LIBRARY_PATH and JAVA_LIBRARY_PATH in hadoop-env.sh 3. set -Djava.library.path=... for mapred.child.java.opts For what you are trying (i.e. given that the JNI libs are present on all machines at a constant path) setting -Djava.library.path for the child task via mapred.child.java.opts should work. What are you seeing? Arun 4. change bin/hadoop to include $LD_LIBRARY_PATH in addition to the path it generates: HADOOP_OPTS=$HADOOP_OPTS -Djava.library.path=$LD_LIBRARY_PATH:$JAVA_LIBRARY_PATH 5. drop the .so files I need into hadoop/lib/native/... 1~3 didn't work, 4 and 5 did but seem to be hacks. I also read that I can do this using DistributedCache, but that seems to be extra work for loading libraries that are already present on each machine. (I'm using the JNI libs for berkeley db). It seems that there should be a way to configure java.library.path for the mapred jobs. Perhaps bin/hadoop should make use of LD_LIBRARY_PATH? Thanks, - Mimi -- M. Raşit ÖZDAŞ
Re: Loading native libraries
On Feb 10, 2009, at 12:24 PM, Mimi Sun wrote: I see UnsatisfiedLinkError. Also I'm calling System.getProperty(java.library.path) in the reducer and logging it. The only thing that prints out is ...hadoop-0.18.2/bin/../lib/ native/Mac_OS_X-i386-32 I'm using Cascading, not sure if that affects anything. Hmm... that's odd. The framework does try to pass the user provided java.library.path down to the launched jvm. I assume your mapred.child.java.opts looks something like -Xmx heapsize -Djava.library.path=path ? Arun
Re: what's going on :( ?
Hi, Mark Try to add an extra property to that file, and try to examine if hadoop recognizes it. This way you can find out if hadoop uses your configuration file. 2009/2/10 Jeff Hammerbacher ham...@cloudera.com: Hey Mark, In NameNode.java, the DEFAULT_PORT specified for NameNode RPC is 8020. From my understanding of the code, your fs.default.name setting should have overridden this port to be 9000. It appears your Hadoop installation has not picked up the configuration settings appropriately. You might want to see if you have any Hadoop processes running and terminate them (bin/stop-all.sh should help) and then restart your cluster with the new configuration to see if that helps. Later, Jeff On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote: Mark Kerzner wrote: Hi, Hi, why is hadoop suddenly telling me Retrying connect to server: localhost/127.0.0.1:8020 with this configuration configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property property namemapred.job.tracker/name valuelocalhost:9001/value Shouldnt this be valuehdfs://localhost:9001/value Amar /property property namedfs.replication/name value1/value /property /configuration and both this http://localhost:50070/dfshealth.jsp and this http://localhost:50030/jobtracker.jsp links work fine? Thank you, Mark -- M. Raşit ÖZDAŞ