Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread Nathan Marz
You can also try decreasing the replication factor for the  
intermediate files between jobs. This will make writing those files  
faster.


On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:


Hi,
by far I am not an Hadoop expert but I think you can not start Map  
task

until the previous Reduce is finished. Saying this it means that you
probably have to store the Map output to the disk first (because a]  
it may
not fit into memory and b] you would risk data loss if the system  
crashes).

As for the job chaining you can check JobControl class (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html) 
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html 



Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

Regards,
Lukas

On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote:


hi everyone,

i have to chain multiple map reduce jobs  actually 2 to 4 jobs ,  
each of
the jobs depends on the o/p of preceding job. In the reducer of  
each job

I'm
doing very little  just grouping by key from the maps. I want to  
give the
output of one MapReduce job to the next job without having to go to  
the

disk. Does anyone have any ideas on how to do this?

Thanx.





--
http://blog.lukas-vlcek.com/




Unable to access job details

2009-03-20 Thread Nathan Marz
Sometimes I am unable to access a job's details and instead only see.  
I am seeing this on 0.19.2 branch.


HTTP ERROR: 500

Internal Server Error

RequestURI=/jobdetails.jsp

Powered by Jetty://

Does anyone know the cause of this?


Secondary sorting

2009-03-18 Thread Nathan Marz
Does some of the logic of secondary sorting occur during the shuffle  
phase? I am seeing markedly slower copy rates during shuffling rates  
with a job that has secondary sorting.


Running 0.19.2 branch in production before release

2009-03-03 Thread Nathan Marz
I would like to get the community's opinion on this. Do you think it's  
safe to run the unreleased 0.19.2 branch in production? Or do you  
recommend sticking with 0.19.1 for production use? There are some bug  
fixes in 0.19.2 which we would like to take advantage of although they  
are not blocking issues for us. 


Mappers become less utilized as time goes on?

2009-03-03 Thread Nathan Marz
I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have a  
fairly large job with about 29000 map tasks and 72 reducers. there are  
304 map task slots in the cluster. When the job starts, it runs 304  
map tasks at a time. As time goes on the number of map tasks run  
concurrently drops. For at least half of the execution exactly 152  
mappers were run at a time. Towards, the end , when there were only  
100 or so tasks remaining, the number of concurrent mappers quickly  
fell to 2 at a time, bringing the end of the map phase to a crawl.  
This was the only job running on the cluster. Has anyone else seen  
behavior like this?




Re: Mappers become less utilized as time goes on?

2009-03-03 Thread Nathan Marz

Nope... and there were no failed tasks.


On Mar 3, 2009, at 5:16 PM, Runping Qi wrote:


Were task Trackers black-listed?


On Tue, Mar 3, 2009 at 3:25 PM, Nathan Marz nat...@rapleaf.com  
wrote:


I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have  
a fairly
large job with about 29000 map tasks and 72 reducers. there are 304  
map task
slots in the cluster. When the job starts, it runs 304 map tasks at  
a time.
As time goes on the number of map tasks run concurrently drops. For  
at least
half of the execution exactly 152 mappers were run at a time.  
Towards, the

end , when there were only 100 or so tasks remaining, the number of
concurrent mappers quickly fell to 2 at a time, bringing the end of  
the map
phase to a crawl. This was the only job running on the cluster. Has  
anyone

else seen behavior like this?






Shuffle phase

2009-02-26 Thread Nathan Marz
Do the reducers batch copy map outputs from a machine? That is, if a  
machine M has 15 intermediate map outputs destined for machine R, will  
machine R copy the intermediate outputs one at a time or all at once? 


Re: FAILED_UNCLEAN?

2009-02-25 Thread Nathan Marz
This is on Hadoop 0.19.1. The first time I saw it happen, the job was  
hung. That is, 5 map tasks were running, but looking at each task  
there was the FAILED_UNCLEAN task attempt and no other task attempts.  
I reran it again, the job failed immediately, and some of the tasks  
had FAILED_UNCLEAN.


There is one job that runs in parallel with this job, but it's of the  
same priority. The other job had failed when the job I'm describing  
got hung.



On Feb 24, 2009, at 10:46 PM, Amareshwari Sriramadasu wrote:


Nathan Marz wrote:
I have a large job operating on over 2 TB of data, with about 5  
input splits. For some reason (as yet unknown), tasks started  
failing on two of the machines (which got blacklisted). 13 mappers  
failed in total. Of those 13, 8 of the tasks were able to execute  
on another machine without any issues. 5 of the tasks *did not* get  
re-executed on another machine, and their status is marked as  
FAILED_UNCLEAN. Anyone have any idea what's going on? Why isn't  
Hadoop running these tasks on other machines?


Has the job failed/killed or Succeded when you see this situation ?  
Once the job completes, the unclean attempts will not get scheduled.
If not, are there other jobs of higher priority running at the same  
time preventing the cleanups to be launched?

What version of Hadoop are you using? latest trunk?

Thanks
Amareshwari

Thanks,
Nathan Marz








Testing with Distributed Cache

2009-02-10 Thread Nathan Marz
I have some unit tests which run MapReduce jobs and test the inputs/ 
outputs in standalone mode. I recently started using DistributedCache  
in one of these jobs, but now my tests fail with errors such as:


Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:/// 
tmp/file.data
	at  
org 
.apache 
.hadoop 
.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:70)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java: 
1367)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
	at  
org 
.apache 
.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: 
472)
	at  
org 
.apache 
.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:676)



Does anyone know of a way to get DistributedCache working in a test  
environment?


Backing up HDFS?

2009-02-09 Thread Nathan Marz
How do people back up their data that they keep on HDFS? We have many  
TB of data which we need to get backed up but are unclear on how to do  
this efficiently/reliably.


Re: Control over max map/reduce tasks per job

2009-02-03 Thread Nathan Marz
Another use case for per-job task limits is being able to use every  
core in the cluster on a map-only job.




On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote:


Chris,

For my specific use cases, it would be best to be able to set N
mappers/reducers per job per node (so I can explicitly say, run at  
most 2 at
a time of this CPU bound task on any given node).  However, the  
other way
would work as well (on 10 node system, would set job to max 20 tasks  
at a
time globally), but opens up the possibility that a node could be  
assigned

more than 2 of that task.

I would work with whatever is easiest to implement as either would  
be a vast
improvement for me (can run high numbers of network latency bound  
tasks

without fear of cpu bound tasks killing the cluster).

JG




-Original Message-
From: Chris K Wensel [mailto:ch...@wensel.net]
Sent: Tuesday, February 03, 2009 11:34 AM
To: core-user@hadoop.apache.org
Subject: Re: Control over max map/reduce tasks per job

Hey Jonathan

Are you looking to limit the total number of concurrent mapper/
reducers a single job can consume cluster wide, or limit the number
per node?

That is, you have X mappers/reducers, but only can allow N mappers/
reducers to run at a time globally, for a given job.

Or, you are cool with all X running concurrently globally, but want  
to

guarantee that no node can run more than N tasks from that job?

Or both?

just reconciling the conversation we had last week with this thread.

ckw

On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:


All,



I have a few relatively small clusters (5-20 nodes) and am having
trouble
keeping them loaded with my MR jobs.



The primary issue is that I have different jobs that have  
drastically

different patterns.  I have jobs that read/write to/from HBase or
Hadoop
with minimal logic (network throughput bound or io bound), others

that

perform crawling (network latency bound), and one huge parsing
streaming job
(very CPU bound, each task eats a core).



I'd like to launch very large numbers of tasks for network latency
bound
jobs, however the large CPU bound job means I have to keep the max
maps
allowed per node low enough as to not starve the Datanode and
Regionserver.



I'm an HBase dev but not familiar enough with Hadoop MR code to even
know
what would be involved with implementing this.  However, in talking
with
other users, it seems like this would be a well-received option.



I wanted to ping the list before filing an issue because it seems

like

someone may have thought about this in the past.



Thanks.



Jonathan Gray



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/






Re: Control over max map/reduce tasks per job

2009-02-03 Thread Nathan Marz
This is a great idea. For me, this is related to: https://issues.apache.org/jira/browse/HADOOP-5160 
. Being able to set the number of tasks per machine on a job by job  
basis would allow me to solve my problem in a different way. Looking  
at the Hadoop source, it's also probably simpler than changing how  
Hadoop schedules tasks.





On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote:


Chris,

For my specific use cases, it would be best to be able to set N
mappers/reducers per job per node (so I can explicitly say, run at  
most 2 at
a time of this CPU bound task on any given node).  However, the  
other way
would work as well (on 10 node system, would set job to max 20 tasks  
at a
time globally), but opens up the possibility that a node could be  
assigned

more than 2 of that task.

I would work with whatever is easiest to implement as either would  
be a vast
improvement for me (can run high numbers of network latency bound  
tasks

without fear of cpu bound tasks killing the cluster).

JG




-Original Message-
From: Chris K Wensel [mailto:ch...@wensel.net]
Sent: Tuesday, February 03, 2009 11:34 AM
To: core-user@hadoop.apache.org
Subject: Re: Control over max map/reduce tasks per job

Hey Jonathan

Are you looking to limit the total number of concurrent mapper/
reducers a single job can consume cluster wide, or limit the number
per node?

That is, you have X mappers/reducers, but only can allow N mappers/
reducers to run at a time globally, for a given job.

Or, you are cool with all X running concurrently globally, but want  
to

guarantee that no node can run more than N tasks from that job?

Or both?

just reconciling the conversation we had last week with this thread.

ckw

On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:


All,



I have a few relatively small clusters (5-20 nodes) and am having
trouble
keeping them loaded with my MR jobs.



The primary issue is that I have different jobs that have  
drastically

different patterns.  I have jobs that read/write to/from HBase or
Hadoop
with minimal logic (network throughput bound or io bound), others

that

perform crawling (network latency bound), and one huge parsing
streaming job
(very CPU bound, each task eats a core).



I'd like to launch very large numbers of tasks for network latency
bound
jobs, however the large CPU bound job means I have to keep the max
maps
allowed per node low enough as to not starve the Datanode and
Regionserver.



I'm an HBase dev but not familiar enough with Hadoop MR code to even
know
what would be involved with implementing this.  However, in talking
with
other users, it seems like this would be a well-received option.



I wanted to ping the list before filing an issue because it seems

like

someone may have thought about this in the past.



Thanks.



Jonathan Gray



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/






Re: How does Hadoop choose machines for Reducers?

2009-01-30 Thread Nathan Marz
This is a huge problem for my application. I tried setting  
mapred.tasktracker.reduce.tasks.maximum to 1 in the job's JobConf, but  
that didn't have any effect. I'm using a custom output format and it's  
essential that Hadoop distribute the reduce tasks to make use of all  
the machine's as there is contention when multiple reduce tasks run on  
one machine. Since my number of reduce tasks is guaranteed to be less  
than the number of machines in the cluster, there's no reason for  
Hadoop not to make use of the full cluster.


Does anyone know of a way to force Hadoop to distribute reduce tasks  
evenly across all the machines?



On Jan 30, 2009, at 7:32 AM, jason hadoop wrote:

Hadoop just distributes to the available reduce execution slots. I  
don't

believe it pays attention to what machine they are on.
I believe the plan is to take account data locality in future (ie:
distribute tasks to machines that are considered more topologically  
close to
their input split first, but I don't think this is available to most  
users.)



On Thu, Jan 29, 2009 at 7:05 PM, Nathan Marz nat...@rapleaf.com  
wrote:


I have a MapReduce application in which I configure 16 reducers to  
run on
15 machines. My mappers output exactly 16 keys, IntWritable's from  
0 to 15.
However, only 12 out of the 15 machines are used to run the 16  
reducers (4
machines have 2 reducers running on each). Is there a way to get  
Hadoop to

use all the machines for reducing?





How does Hadoop choose machines for Reducers?

2009-01-29 Thread Nathan Marz
I have a MapReduce application in which I configure 16 reducers to run  
on 15 machines. My mappers output exactly 16 keys, IntWritable's from  
0 to 15. However, only 12 out of the 15 machines are used to run the  
16 reducers (4 machines have 2 reducers running on each). Is there a  
way to get Hadoop to use all the machines for reducing?


Unusual Failure of jobs

2008-12-22 Thread Nathan Marz
I have been experiencing some unusual behavior from Hadoop recently.  
When trying to run a job, some of the tasks fail with:


java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403


Not all the tasks fail, but enough tasks fail such that the job fails.  
Unfortunately, there are no further logs for these tasks. Trying to  
retrieve the logs produces:


HTTP ERROR: 410

Failed to retrieve stdout log for task:  
attempt_200811101232_0218_m_01_0


RequestURI=/tasklog


It seems like the tasktracker isn't able to even start the tasks on  
those machines. Has anyone seen anything like this before?




We're looking for an Amazing Software Engineers (+ interns):
http://business.rapleaf.com/careers.html

The Rapleaf Bailout Plan - Send a qualified referral (resume) and we
will award you with $10,007 bailout package if we hire that person.



_temporary directories not deleted

2008-11-04 Thread Nathan Marz

Hello all,

Occasionally when running jobs, Hadoop fails to clean up the  
_temporary directories it has left behind. This only appears to  
happen when a task is killed (aka a speculative execution), and the  
data that task has outputted so far is not cleaned up. Is this a known  
issue in hadoop? Is the data from that task guaranteed to be duplicate  
data of what was outputted by another task? Is it safe to just delete  
this directory without worrying about losing data?


Thanks,
Nathan Marz
Rapleaf


LeaseExpiredException and too many xceiver

2008-10-31 Thread Nathan Marz

Hello,

We are seeing some really bad errors on our hadoop cluster. After  
reformatting the whole cluster, the first job we run immediately fails  
with Could not find block locations... errrors. In the namenode  
logs, we see a ton of errors like:


2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 5 on 7276, call addBlock(/tmp/dustintmp/shredded_dataunits/_t$
org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/ 
dustintmp/shredded_dataunits/_temporary/ 
_attempt_200810311418_0002_m_23_0$
at  
org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166)
at  
org 
.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 
1097)

at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)

at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



In the datanode logs, we see a ton of errors like:

2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-2129547091-10.100.11.1$

of concurrent xcievers 256
at org.apache.hadoop.dfs.DataNode 
$DataXceiver.run(DataNode.java:1030)

at java.lang.Thread.run(Thread.java:619)



Anyone have any ideas on what may be wrong?

Thanks,
Nathan Marz
Rapleaf


Re: LeaseExpiredException and too many xceiver

2008-10-31 Thread Nathan Marz
Looks like the exception on the datanode got truncated a little bit.  
Here's the full exception:


2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,
storageID=DS-2129547091-10.100.11.115-50010-1225485937590,  
infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException:

xceiverCount 257 exceeds the limit of concurrent xcievers 256
at org.apache.hadoop.dfs.DataNode 
$DataXceiver.run(DataNode.java:1030)

at java.lang.Thread.run(Thread.java:619)


On Oct 31, 2008, at 2:49 PM, Nathan Marz wrote:


Hello,

We are seeing some really bad errors on our hadoop cluster. After  
reformatting the whole cluster, the first job we run immediately  
fails with Could not find block locations... errrors. In the  
namenode logs, we see a ton of errors like:


2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC  
Server handler 5 on 7276, call addBlock(/tmp/dustintmp/ 
shredded_dataunits/_t$
org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/ 
dustintmp/shredded_dataunits/_temporary/ 
_attempt_200810311418_0002_m_23_0$
   at  
org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166)
   at  
org 
.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 
1097)

   at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
   at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
   at  
sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



In the datanode logs, we see a ton of errors like:

2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-2129547091-10.100.11.1$

of concurrent xcievers 256
   at org.apache.hadoop.dfs.DataNode 
$DataXceiver.run(DataNode.java:1030)

   at java.lang.Thread.run(Thread.java:619)



Anyone have any ideas on what may be wrong?

Thanks,
Nathan Marz
Rapleaf




Re: Turning off FileSystem statistics during MapReduce

2008-10-06 Thread Nathan Marz
We see this on Maps and only on incrementBytesRead (not on  
incrementBytesWritten). It is on HDFS where we are seeing the time  
spent. It seems that this is because incrementBytesRead is called  
every time a record is read, while incrementBytesWritten is only  
called when a buffer is spilled. We would benefit a lot from being  
able to turn this off.




On Oct 3, 2008, at 6:19 PM, Arun C Murthy wrote:


Nathan,

On Oct 3, 2008, at 5:18 PM, Nathan Marz wrote:


Hello,

We have been doing some profiling of our MapReduce jobs, and we are  
seeing about 20% of the time of our jobs is spent calling  
FileSystem$Statistics.incrementBytesRead when we interact with  
the FileSystem. Is there a way to turn this stats-collection off?




This is interesting... could you provide more details? Are you  
seeing this on Maps or Reduces? Which FileSystem exhibited this i.e.  
HDFS or LocalFS? Any details on about your application?


To answer your original question - no, there isn't a way to disable  
this. However, if this turns out to be a systemic problem we  
definitely should consider having an option to allow users to switch  
it off.


So any information you can provide helps - thanks!

Arun



Thanks,
Nathan Marz
Rapleaf







Turning off FileSystem statistics during MapReduce

2008-10-03 Thread Nathan Marz

Hello,

We have been doing some profiling of our MapReduce jobs, and we are  
seeing about 20% of the time of our jobs is spent calling FileSystem 
$Statistics.incrementBytesRead when we interact with the FileSystem.  
Is there a way to turn this stats-collection off?


Thanks,
Nathan Marz
Rapleaf



Re: LZO and native hadoop libraries

2008-10-01 Thread Nathan Marz
Yes, this is exactly what I'm seeing. To be honest, I don't know which  
LZO native library it should be looking for. The LZO install dropped  
liblzo2.la and liblzo2.a in my /usr/local/lib directory, but not a  
file with a .so extension. Hardcoding would be fine as a temporary  
solution, but I don't know what to hardcode.


Thanks,
Nathan


On Sep 30, 2008, at 8:45 PM, Amareshwari Sriramadasu wrote:


Are you seeing HADOOP-2009?

Thanks
Amareshwari
Nathan Marz wrote:
Unfortunately, setting those environment variables did not help my  
issue. It appears that the HADOOP_LZO_LIBRARY variable is not  
defined in both LzoCompressor.c and LzoDecompressor.c. Where is  
this variable supposed to be set?




On Sep 30, 2008, at 12:33 PM, Colin Evans wrote:


Hi Nathan,
You probably need to add the Java headers to your build path as  
well - I don't know why the Mac doesn't ship with this as a  
default setting:


export CPATH=/System/Library/Frameworks/JavaVM.framework/Versions/ 
CurrentJDK/Home/include 
export CPPFLAGS=-I/System/Library/Frameworks/JavaVM.framework/ 
Versions/CurrentJDK/Home/include





Nathan Marz wrote:
Thanks for the help. I was able to get past my previous issue,  
but the native build is still failing. Here is the end of the log  
output:


   [exec] then mv -f .deps/LzoCompressor.Tpo .deps/ 
LzoCompressor.Plo; else rm -f .deps/LzoCompressor.Tpo; exit 1;  
fi

   [exec] mkdir .libs
   [exec]  gcc -DHAVE_CONFIG_H -I. -I/Users/nathan/Downloads/ 
hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo - 
I../../../../../../.. -I/Library/Java/Home//include -I/Users/ 
nathan/Downloads/hadoop-0.18.1/src/native/src -g -Wall -fPIC -O2 - 
m32 -g -O2 -MT LzoCompressor.lo -MD -MP -MF .deps/ 
LzoCompressor.Tpo -c /Users/nathan/Downloads/hadoop-0.18.1/src/ 
native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  - 
fno-common -DPIC -o .libs/LzoCompressor.o
   [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/ 
org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function  
'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
   [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/ 
org/apache/hadoop/io/compress/lzo/LzoCompressor.c:135: error:  
syntax error before ',' token

   [exec] make[2]: *** [LzoCompressor.lo] Error 1
   [exec] make[1]: *** [all-recursive] Error 1
   [exec] make: *** [all] Error 2


Any ideas?



On Sep 30, 2008, at 11:53 AM, Colin Evans wrote:


There's a patch to get the native targets to build on Mac OS X:

http://issues.apache.org/jira/browse/HADOOP-3659

You probably will need to monkey with LDFLAGS as well to get it  
to work, but we've been able to build the native libs for the  
Mac without too much trouble.



Doug Cutting wrote:

Arun C Murthy wrote:
You need to add libhadoop.so to your java.library.patch.  
libhadoop.so is available in the corresponding release in the  
lib/native directory.


I think he needs to first build libhadoop.so, since he appears  
to be running on OS X and we only provide Linux builds of this  
in releases.


Doug














LZO and native hadoop libraries

2008-09-30 Thread Nathan Marz
I am trying to use SequenceFiles with LZO compression outside the  
context of a MapReduce application. However, when I try to use the LZO  
codec, I get the following errors in the log:


08/09/30 11:09:56 DEBUG conf.Configuration: java.io.IOException:  
config()

at org.apache.hadoop.conf.Configuration.init(Configuration.java:157)
	at  
com 
.rapleaf 
.formats 
.stream.TestSequenceFileStreams.setUp(TestSequenceFileStreams.java:22)

at junit.framework.TestCase.runBare(TestCase.java:125)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
	at  
org 
.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java: 
81)

at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:36)
	at  
org 
.apache 
.tools 
.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java: 
421)
	at  
org 
.apache 
.tools 
.ant 
.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java: 
912)
	at  
org 
.apache 
.tools 
.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java: 
766)


08/09/30 11:09:56 DEBUG security.UserGroupInformation: Unix Login:  
nathan,staff,_lpadmin,com.apple.sharepoint.group. 
1,_appserveradm,_appserverusr,admin,com.apple.access_ssh
08/09/30 11:09:56 DEBUG util.NativeCodeLoader: Trying to load the  
custom-built native-hadoop library...
08/09/30 11:09:56 DEBUG util.NativeCodeLoader: Failed to load native- 
hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in  
java.library.path
08/09/30 11:09:56 DEBUG util.NativeCodeLoader: java.library.path=.:/ 
Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java
08/09/30 11:09:56 WARN util.NativeCodeLoader: Unable to load native- 
hadoop library for your platform... using builtin-java classes where  
applicable
08/09/30 11:09:56 ERROR compress.LzoCodec: Cannot load native-lzo  
without native-hadoop



What is the native hadoop library and how should I configure things to  
use it?




Thanks,

Nathan Marz
RapLeaf



Re: LZO and native hadoop libraries

2008-09-30 Thread Nathan Marz
Thanks for the help. I was able to get past my previous issue, but the  
native build is still failing. Here is the end of the log output:


 [exec] 	then mv -f .deps/LzoCompressor.Tpo .deps/ 
LzoCompressor.Plo; else rm -f .deps/LzoCompressor.Tpo; exit 1; fi

 [exec] mkdir .libs
 [exec]  gcc -DHAVE_CONFIG_H -I. -I/Users/nathan/Downloads/ 
hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo - 
I../../../../../../.. -I/Library/Java/Home//include -I/Users/nathan/ 
Downloads/hadoop-0.18.1/src/native/src -g -Wall -fPIC -O2 -m32 -g -O2 - 
MT LzoCompressor.lo -MD -MP -MF .deps/LzoCompressor.Tpo -c /Users/ 
nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/hadoop/io/ 
compress/lzo/LzoCompressor.c  -fno-common -DPIC -o .libs/LzoCompressor.o
 [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ 
apache/hadoop/io/compress/lzo/LzoCompressor.c: In function  
'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
 [exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ 
apache/hadoop/io/compress/lzo/LzoCompressor.c:135: error: syntax error  
before ',' token

 [exec] make[2]: *** [LzoCompressor.lo] Error 1
 [exec] make[1]: *** [all-recursive] Error 1
 [exec] make: *** [all] Error 2


Any ideas?



On Sep 30, 2008, at 11:53 AM, Colin Evans wrote:


There's a patch to get the native targets to build on Mac OS X:

http://issues.apache.org/jira/browse/HADOOP-3659

You probably will need to monkey with LDFLAGS as well to get it to  
work, but we've been able to build the native libs for the Mac  
without too much trouble.



Doug Cutting wrote:

Arun C Murthy wrote:
You need to add libhadoop.so to your java.library.patch.  
libhadoop.so is available in the corresponding release in the lib/ 
native directory.


I think he needs to first build libhadoop.so, since he appears to  
be running on OS X and we only provide Linux builds of this in  
releases.


Doug






Re: LZO and native hadoop libraries

2008-09-30 Thread Nathan Marz
Unfortunately, setting those environment variables did not help my  
issue. It appears that the HADOOP_LZO_LIBRARY variable is not  
defined in both LzoCompressor.c and LzoDecompressor.c. Where is this  
variable supposed to be set?




On Sep 30, 2008, at 12:33 PM, Colin Evans wrote:


Hi Nathan,
You probably need to add the Java headers to your build path as well  
- I don't know why the Mac doesn't ship with this as a default  
setting:


export CPATH=/System/Library/Frameworks/JavaVM.framework/Versions/ 
CurrentJDK/Home/include 
export CPPFLAGS=-I/System/Library/Frameworks/JavaVM.framework/ 
Versions/CurrentJDK/Home/include





Nathan Marz wrote:
Thanks for the help. I was able to get past my previous issue, but  
the native build is still failing. Here is the end of the log output:


[exec] then mv -f .deps/LzoCompressor.Tpo .deps/ 
LzoCompressor.Plo; else rm -f .deps/LzoCompressor.Tpo; exit 1; fi

[exec] mkdir .libs
[exec]  gcc -DHAVE_CONFIG_H -I. -I/Users/nathan/Downloads/ 
hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo - 
I../../../../../../.. -I/Library/Java/Home//include -I/Users/nathan/ 
Downloads/hadoop-0.18.1/src/native/src -g -Wall -fPIC -O2 -m32 -g - 
O2 -MT LzoCompressor.lo -MD -MP -MF .deps/LzoCompressor.Tpo -c / 
Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/ 
hadoop/io/compress/lzo/LzoCompressor.c  -fno-common -DPIC -o .libs/ 
LzoCompressor.o
[exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ 
apache/hadoop/io/compress/lzo/LzoCompressor.c: In function  
'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
[exec] /Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/ 
apache/hadoop/io/compress/lzo/LzoCompressor.c:135: error: syntax  
error before ',' token

[exec] make[2]: *** [LzoCompressor.lo] Error 1
[exec] make[1]: *** [all-recursive] Error 1
[exec] make: *** [all] Error 2


Any ideas?



On Sep 30, 2008, at 11:53 AM, Colin Evans wrote:


There's a patch to get the native targets to build on Mac OS X:

http://issues.apache.org/jira/browse/HADOOP-3659

You probably will need to monkey with LDFLAGS as well to get it to  
work, but we've been able to build the native libs for the Mac  
without too much trouble.



Doug Cutting wrote:

Arun C Murthy wrote:
You need to add libhadoop.so to your java.library.patch.  
libhadoop.so is available in the corresponding release in the  
lib/native directory.


I think he needs to first build libhadoop.so, since he appears to  
be running on OS X and we only provide Linux builds of this in  
releases.


Doug










Custom input format getSplits being called twice

2008-09-25 Thread Nathan Marz

Hello all,

I am getting some odd behavior from hadoop which seems like a bug. I  
have created a custom input format, and I am observing that my  
getSplits method is being called twice. Each call is on a different  
instance of the input format. The job, however, is only run once,  
using the result from the second call to getSplits. The first call  
receives the numSplits hint as expected, while in the second call that  
value is overriden to 1. I am running hadoop in standalone mode. Does  
anyone know anything about this issue?


Thanks,

Nathan Marz
Rapleaf


Parameterized InputFormats

2008-06-30 Thread Nathan Marz

Hello,

Are there any plans to change the JobConf API so that it takes an  
instance of an InputFormat rather than the InputFormat class? I am  
finding the inability to properly parameterize my InputFormats to be  
very restricting. What's the reasoning behind having the class as a  
parameter rather than an instance?


-Nathan Marz