Re: Chaining Multiple Map reduce jobs.
You can also try decreasing the replication factor for the intermediate files between jobs. This will make writing those files faster. On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote: Hi, by far I am not an Hadoop expert but I think you can not start Map task until the previous Reduce is finished. Saying this it means that you probably have to store the Map output to the disk first (because a] it may not fit into memory and b] you would risk data loss if the system crashes). As for the job chaining you can check JobControl class ( http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html) http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702 Regards, Lukas On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote: hi everyone, i have to chain multiple map reduce jobs actually 2 to 4 jobs , each of the jobs depends on the o/p of preceding job. In the reducer of each job I'm doing very little just grouping by key from the maps. I want to give the output of one MapReduce job to the next job without having to go to the disk. Does anyone have any ideas on how to do this? Thanx. -- http://blog.lukas-vlcek.com/
Unable to access job details
Sometimes I am unable to access a job's details and instead only see. I am seeing this on 0.19.2 branch. HTTP ERROR: 500 Internal Server Error RequestURI=/jobdetails.jsp Powered by Jetty:// Does anyone know the cause of this?
Secondary sorting
Does some of the logic of secondary sorting occur during the shuffle phase? I am seeing markedly slower copy rates during shuffling rates with a job that has secondary sorting.
Running 0.19.2 branch in production before release
I would like to get the community's opinion on this. Do you think it's safe to run the unreleased 0.19.2 branch in production? Or do you recommend sticking with 0.19.1 for production use? There are some bug fixes in 0.19.2 which we would like to take advantage of although they are not blocking issues for us.
Mappers become less utilized as time goes on?
I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have a fairly large job with about 29000 map tasks and 72 reducers. there are 304 map task slots in the cluster. When the job starts, it runs 304 map tasks at a time. As time goes on the number of map tasks run concurrently drops. For at least half of the execution exactly 152 mappers were run at a time. Towards, the end , when there were only 100 or so tasks remaining, the number of concurrent mappers quickly fell to 2 at a time, bringing the end of the map phase to a crawl. This was the only job running on the cluster. Has anyone else seen behavior like this?
Re: Mappers become less utilized as time goes on?
Nope... and there were no failed tasks. On Mar 3, 2009, at 5:16 PM, Runping Qi wrote: Were task Trackers black-listed? On Tue, Mar 3, 2009 at 3:25 PM, Nathan Marz nat...@rapleaf.com wrote: I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have a fairly large job with about 29000 map tasks and 72 reducers. there are 304 map task slots in the cluster. When the job starts, it runs 304 map tasks at a time. As time goes on the number of map tasks run concurrently drops. For at least half of the execution exactly 152 mappers were run at a time. Towards, the end , when there were only 100 or so tasks remaining, the number of concurrent mappers quickly fell to 2 at a time, bringing the end of the map phase to a crawl. This was the only job running on the cluster. Has anyone else seen behavior like this?
Shuffle phase
Do the reducers batch copy map outputs from a machine? That is, if a machine M has 15 intermediate map outputs destined for machine R, will machine R copy the intermediate outputs one at a time or all at once?
Re: FAILED_UNCLEAN?
This is on Hadoop 0.19.1. The first time I saw it happen, the job was hung. That is, 5 map tasks were running, but looking at each task there was the FAILED_UNCLEAN task attempt and no other task attempts. I reran it again, the job failed immediately, and some of the tasks had FAILED_UNCLEAN. There is one job that runs in parallel with this job, but it's of the same priority. The other job had failed when the job I'm describing got hung. On Feb 24, 2009, at 10:46 PM, Amareshwari Sriramadasu wrote: Nathan Marz wrote: I have a large job operating on over 2 TB of data, with about 5 input splits. For some reason (as yet unknown), tasks started failing on two of the machines (which got blacklisted). 13 mappers failed in total. Of those 13, 8 of the tasks were able to execute on another machine without any issues. 5 of the tasks *did not* get re-executed on another machine, and their status is marked as FAILED_UNCLEAN. Anyone have any idea what's going on? Why isn't Hadoop running these tasks on other machines? Has the job failed/killed or Succeded when you see this situation ? Once the job completes, the unclean attempts will not get scheduled. If not, are there other jobs of higher priority running at the same time preventing the cleanups to be launched? What version of Hadoop are you using? latest trunk? Thanks Amareshwari Thanks, Nathan Marz
Testing with Distributed Cache
I have some unit tests which run MapReduce jobs and test the inputs/ outputs in standalone mode. I recently started using DistributedCache in one of these jobs, but now my tests fail with errors such as: Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:/// tmp/file.data at org .apache .hadoop .hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:70) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java: 1367) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org .apache .hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: 472) at org .apache .hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:676) Does anyone know of a way to get DistributedCache working in a test environment?
Backing up HDFS?
How do people back up their data that they keep on HDFS? We have many TB of data which we need to get backed up but are unclear on how to do this efficiently/reliably.
Re: Control over max map/reduce tasks per job
Another use case for per-job task limits is being able to use every core in the cluster on a map-only job. On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any given node). However, the other way would work as well (on 10 node system, would set job to max 20 tasks at a time globally), but opens up the possibility that a node could be assigned more than 2 of that task. I would work with whatever is easiest to implement as either would be a vast improvement for me (can run high numbers of network latency bound tasks without fear of cpu bound tasks killing the cluster). JG -Original Message- From: Chris K Wensel [mailto:ch...@wensel.net] Sent: Tuesday, February 03, 2009 11:34 AM To: core-user@hadoop.apache.org Subject: Re: Control over max map/reduce tasks per job Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Control over max map/reduce tasks per job
This is a great idea. For me, this is related to: https://issues.apache.org/jira/browse/HADOOP-5160 . Being able to set the number of tasks per machine on a job by job basis would allow me to solve my problem in a different way. Looking at the Hadoop source, it's also probably simpler than changing how Hadoop schedules tasks. On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any given node). However, the other way would work as well (on 10 node system, would set job to max 20 tasks at a time globally), but opens up the possibility that a node could be assigned more than 2 of that task. I would work with whatever is easiest to implement as either would be a vast improvement for me (can run high numbers of network latency bound tasks without fear of cpu bound tasks killing the cluster). JG -Original Message- From: Chris K Wensel [mailto:ch...@wensel.net] Sent: Tuesday, February 03, 2009 11:34 AM To: core-user@hadoop.apache.org Subject: Re: Control over max map/reduce tasks per job Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: How does Hadoop choose machines for Reducers?
This is a huge problem for my application. I tried setting mapred.tasktracker.reduce.tasks.maximum to 1 in the job's JobConf, but that didn't have any effect. I'm using a custom output format and it's essential that Hadoop distribute the reduce tasks to make use of all the machine's as there is contention when multiple reduce tasks run on one machine. Since my number of reduce tasks is guaranteed to be less than the number of machines in the cluster, there's no reason for Hadoop not to make use of the full cluster. Does anyone know of a way to force Hadoop to distribute reduce tasks evenly across all the machines? On Jan 30, 2009, at 7:32 AM, jason hadoop wrote: Hadoop just distributes to the available reduce execution slots. I don't believe it pays attention to what machine they are on. I believe the plan is to take account data locality in future (ie: distribute tasks to machines that are considered more topologically close to their input split first, but I don't think this is available to most users.) On Thu, Jan 29, 2009 at 7:05 PM, Nathan Marz nat...@rapleaf.com wrote: I have a MapReduce application in which I configure 16 reducers to run on 15 machines. My mappers output exactly 16 keys, IntWritable's from 0 to 15. However, only 12 out of the 15 machines are used to run the 16 reducers (4 machines have 2 reducers running on each). Is there a way to get Hadoop to use all the machines for reducing?
How does Hadoop choose machines for Reducers?
I have a MapReduce application in which I configure 16 reducers to run on 15 machines. My mappers output exactly 16 keys, IntWritable's from 0 to 15. However, only 12 out of the 15 machines are used to run the 16 reducers (4 machines have 2 reducers running on each). Is there a way to get Hadoop to use all the machines for reducing?
Unusual Failure of jobs
I have been experiencing some unusual behavior from Hadoop recently. When trying to run a job, some of the tasks fail with: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403 Not all the tasks fail, but enough tasks fail such that the job fails. Unfortunately, there are no further logs for these tasks. Trying to retrieve the logs produces: HTTP ERROR: 410 Failed to retrieve stdout log for task: attempt_200811101232_0218_m_01_0 RequestURI=/tasklog It seems like the tasktracker isn't able to even start the tasks on those machines. Has anyone seen anything like this before? We're looking for an Amazing Software Engineers (+ interns): http://business.rapleaf.com/careers.html The Rapleaf Bailout Plan - Send a qualified referral (resume) and we will award you with $10,007 bailout package if we hire that person.
_temporary directories not deleted
Hello all, Occasionally when running jobs, Hadoop fails to clean up the _temporary directories it has left behind. This only appears to happen when a task is killed (aka a speculative execution), and the data that task has outputted so far is not cleaned up. Is this a known issue in hadoop? Is the data from that task guaranteed to be duplicate data of what was outputted by another task? Is it safe to just delete this directory without worrying about losing data? Thanks, Nathan Marz Rapleaf
LeaseExpiredException and too many xceiver
Hello, We are seeing some really bad errors on our hadoop cluster. After reformatting the whole cluster, the first job we run immediately fails with Could not find block locations... errrors. In the namenode logs, we see a ton of errors like: 2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 7276, call addBlock(/tmp/dustintmp/shredded_dataunits/_t$ org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/ dustintmp/shredded_dataunits/_temporary/ _attempt_200810311418_0002_m_23_0$ at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166) at org .apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun .reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) In the datanode logs, we see a ton of errors like: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.1$ of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode $DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) Anyone have any ideas on what may be wrong? Thanks, Nathan Marz Rapleaf
Re: LeaseExpiredException and too many xceiver
Looks like the exception on the datanode got truncated a little bit. Here's the full exception: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.115-50010-1225485937590, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode $DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) On Oct 31, 2008, at 2:49 PM, Nathan Marz wrote: Hello, We are seeing some really bad errors on our hadoop cluster. After reformatting the whole cluster, the first job we run immediately fails with Could not find block locations... errrors. In the namenode logs, we see a ton of errors like: 2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 7276, call addBlock(/tmp/dustintmp/ shredded_dataunits/_t$ org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/ dustintmp/shredded_dataunits/_temporary/ _attempt_200810311418_0002_m_23_0$ at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166) at org .apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) In the datanode logs, we see a ton of errors like: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.1$ of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode $DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) Anyone have any ideas on what may be wrong? Thanks, Nathan Marz Rapleaf
Re: Turning off FileSystem statistics during MapReduce
We see this on Maps and only on incrementBytesRead (not on incrementBytesWritten). It is on HDFS where we are seeing the time spent. It seems that this is because incrementBytesRead is called every time a record is read, while incrementBytesWritten is only called when a buffer is spilled. We would benefit a lot from being able to turn this off. On Oct 3, 2008, at 6:19 PM, Arun C Murthy wrote: Nathan, On Oct 3, 2008, at 5:18 PM, Nathan Marz wrote: Hello, We have been doing some profiling of our MapReduce jobs, and we are seeing about 20% of the time of our jobs is spent calling FileSystem$Statistics.incrementBytesRead when we interact with the FileSystem. Is there a way to turn this stats-collection off? This is interesting... could you provide more details? Are you seeing this on Maps or Reduces? Which FileSystem exhibited this i.e. HDFS or LocalFS? Any details on about your application? To answer your original question - no, there isn't a way to disable this. However, if this turns out to be a systemic problem we definitely should consider having an option to allow users to switch it off. So any information you can provide helps - thanks! Arun Thanks, Nathan Marz Rapleaf
Turning off FileSystem statistics during MapReduce
Hello, We have been doing some profiling of our MapReduce jobs, and we are seeing about 20% of the time of our jobs is spent calling FileSystem $Statistics.incrementBytesRead when we interact with the FileSystem. Is there a way to turn this stats-collection off? Thanks, Nathan Marz Rapleaf
Re: LZO and native hadoop libraries
Yes, this is exactly what I'm seeing. To be honest, I don't know which LZO native library it should be looking for. The LZO install dropped liblzo2.la and liblzo2.a in my /usr/local/lib directory, but not a file with a .so extension. Hardcoding would be fine as a temporary solution, but I don't know what to hardcode. Thanks, Nathan On Sep 30, 2008, at 8:45 PM, Amareshwari Sriramadasu wrote: Are you seeing HADOOP-2009? Thanks Amareshwari Nathan Marz wrote: Unfortunately, setting those environment variables did not help my issue. It appears that the HADOOP_LZO_LIBRARY variable is not defined in both LzoCompressor.c and LzoDecompressor.c. Where is this variable supposed to be set? On Sep 30, 2008, at 12:33 PM, Colin Evans wrote: Hi Nathan, You probably need to add the Java headers to your build path as well - I don't know why the Mac doesn't ship with this as a default setting: export CPATH=/System/Library/Frameworks/JavaVM.framework/Versions/ CurrentJDK/Home/include export CPPFLAGS=-I/System/Library/Frameworks/JavaVM.framework/ Versions/CurrentJDK/Home/include Nathan Marz wrote: Thanks for the help. I was able to get past my previous issue, but the native build is still failing. Here is the end of the log output: [exec] then mv -f .deps/LzoCompressor.Tpo .deps/ LzoCompressor.Plo; else rm -f .deps/LzoCompressor.Tpo; exit 1; fi [exec] mkdir .libs [exec] gcc -DHAVE_CONFIG_H -I. -I/Users/nathan/Downloads/ hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo - I../../../../../../.. -I/Library/Java/Home//include -I/Users/ nathan/Downloads/hadoop-0.18.1/src/native/src -g -Wall -fPIC -O2 - m32 -g -O2 -MT LzoCompressor.lo -MD -MP -MF .deps/ LzoCompressor.Tpo -c /Users/nathan/Downloads/hadoop-0.18.1/src/ native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c - fno-common -DPIC -o .libs/LzoCompressor.o [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/ org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/ org/apache/hadoop/io/compress/lzo/LzoCompressor.c:135: error: syntax error before ',' token [exec] make[2]: *** [LzoCompressor.lo] Error 1 [exec] make[1]: *** [all-recursive] Error 1 [exec] make: *** [all] Error 2 Any ideas? On Sep 30, 2008, at 11:53 AM, Colin Evans wrote: There's a patch to get the native targets to build on Mac OS X: http://issues.apache.org/jira/browse/HADOOP-3659 You probably will need to monkey with LDFLAGS as well to get it to work, but we've been able to build the native libs for the Mac without too much trouble. Doug Cutting wrote: Arun C Murthy wrote: You need to add libhadoop.so to your java.library.patch. libhadoop.so is available in the corresponding release in the lib/native directory. I think he needs to first build libhadoop.so, since he appears to be running on OS X and we only provide Linux builds of this in releases. Doug
LZO and native hadoop libraries
I am trying to use SequenceFiles with LZO compression outside the context of a MapReduce application. However, when I try to use the LZO codec, I get the following errors in the log: 08/09/30 11:09:56 DEBUG conf.Configuration: java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:157) at com .rapleaf .formats .stream.TestSequenceFileStreams.setUp(TestSequenceFileStreams.java:22) at junit.framework.TestCase.runBare(TestCase.java:125) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org .junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java: 81) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:36) at org .apache .tools .ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java: 421) at org .apache .tools .ant .taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java: 912) at org .apache .tools .ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java: 766) 08/09/30 11:09:56 DEBUG security.UserGroupInformation: Unix Login: nathan,staff,_lpadmin,com.apple.sharepoint.group. 1,_appserveradm,_appserverusr,admin,com.apple.access_ssh 08/09/30 11:09:56 DEBUG util.NativeCodeLoader: Trying to load the custom-built native-hadoop library... 08/09/30 11:09:56 DEBUG util.NativeCodeLoader: Failed to load native- hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path 08/09/30 11:09:56 DEBUG util.NativeCodeLoader: java.library.path=.:/ Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java 08/09/30 11:09:56 WARN util.NativeCodeLoader: Unable to load native- hadoop library for your platform... using builtin-java classes where applicable 08/09/30 11:09:56 ERROR compress.LzoCodec: Cannot load native-lzo without native-hadoop What is the native hadoop library and how should I configure things to use it? Thanks, Nathan Marz RapLeaf
Re: LZO and native hadoop libraries
Thanks for the help. I was able to get past my previous issue, but the native build is still failing. Here is the end of the log output: [exec] then mv -f .deps/LzoCompressor.Tpo .deps/ LzoCompressor.Plo; else rm -f .deps/LzoCompressor.Tpo; exit 1; fi [exec] mkdir .libs [exec] gcc -DHAVE_CONFIG_H -I. -I/Users/nathan/Downloads/ hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo - I../../../../../../.. -I/Library/Java/Home//include -I/Users/nathan/ Downloads/hadoop-0.18.1/src/native/src -g -Wall -fPIC -O2 -m32 -g -O2 - MT LzoCompressor.lo -MD -MP -MF .deps/LzoCompressor.Tpo -c /Users/ nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/hadoop/io/ compress/lzo/LzoCompressor.c -fno-common -DPIC -o .libs/LzoCompressor.o [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ apache/hadoop/io/compress/lzo/LzoCompressor.c:135: error: syntax error before ',' token [exec] make[2]: *** [LzoCompressor.lo] Error 1 [exec] make[1]: *** [all-recursive] Error 1 [exec] make: *** [all] Error 2 Any ideas? On Sep 30, 2008, at 11:53 AM, Colin Evans wrote: There's a patch to get the native targets to build on Mac OS X: http://issues.apache.org/jira/browse/HADOOP-3659 You probably will need to monkey with LDFLAGS as well to get it to work, but we've been able to build the native libs for the Mac without too much trouble. Doug Cutting wrote: Arun C Murthy wrote: You need to add libhadoop.so to your java.library.patch. libhadoop.so is available in the corresponding release in the lib/ native directory. I think he needs to first build libhadoop.so, since he appears to be running on OS X and we only provide Linux builds of this in releases. Doug
Re: LZO and native hadoop libraries
Unfortunately, setting those environment variables did not help my issue. It appears that the HADOOP_LZO_LIBRARY variable is not defined in both LzoCompressor.c and LzoDecompressor.c. Where is this variable supposed to be set? On Sep 30, 2008, at 12:33 PM, Colin Evans wrote: Hi Nathan, You probably need to add the Java headers to your build path as well - I don't know why the Mac doesn't ship with this as a default setting: export CPATH=/System/Library/Frameworks/JavaVM.framework/Versions/ CurrentJDK/Home/include export CPPFLAGS=-I/System/Library/Frameworks/JavaVM.framework/ Versions/CurrentJDK/Home/include Nathan Marz wrote: Thanks for the help. I was able to get past my previous issue, but the native build is still failing. Here is the end of the log output: [exec] then mv -f .deps/LzoCompressor.Tpo .deps/ LzoCompressor.Plo; else rm -f .deps/LzoCompressor.Tpo; exit 1; fi [exec] mkdir .libs [exec] gcc -DHAVE_CONFIG_H -I. -I/Users/nathan/Downloads/ hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo - I../../../../../../.. -I/Library/Java/Home//include -I/Users/nathan/ Downloads/hadoop-0.18.1/src/native/src -g -Wall -fPIC -O2 -m32 -g - O2 -MT LzoCompressor.lo -MD -MP -MF .deps/LzoCompressor.Tpo -c / Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/ hadoop/io/compress/lzo/LzoCompressor.c -fno-common -DPIC -o .libs/ LzoCompressor.o [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ apache/hadoop/io/compress/lzo/LzoCompressor.c:135: error: syntax error before ',' token [exec] make[2]: *** [LzoCompressor.lo] Error 1 [exec] make[1]: *** [all-recursive] Error 1 [exec] make: *** [all] Error 2 Any ideas? On Sep 30, 2008, at 11:53 AM, Colin Evans wrote: There's a patch to get the native targets to build on Mac OS X: http://issues.apache.org/jira/browse/HADOOP-3659 You probably will need to monkey with LDFLAGS as well to get it to work, but we've been able to build the native libs for the Mac without too much trouble. Doug Cutting wrote: Arun C Murthy wrote: You need to add libhadoop.so to your java.library.patch. libhadoop.so is available in the corresponding release in the lib/native directory. I think he needs to first build libhadoop.so, since he appears to be running on OS X and we only provide Linux builds of this in releases. Doug
Custom input format getSplits being called twice
Hello all, I am getting some odd behavior from hadoop which seems like a bug. I have created a custom input format, and I am observing that my getSplits method is being called twice. Each call is on a different instance of the input format. The job, however, is only run once, using the result from the second call to getSplits. The first call receives the numSplits hint as expected, while in the second call that value is overriden to 1. I am running hadoop in standalone mode. Does anyone know anything about this issue? Thanks, Nathan Marz Rapleaf
Parameterized InputFormats
Hello, Are there any plans to change the JobConf API so that it takes an instance of an InputFormat rather than the InputFormat class? I am finding the inability to properly parameterize my InputFormats to be very restricting. What's the reasoning behind having the class as a parameter rather than an instance? -Nathan Marz