RE: Hadoop Write Performance

2009-02-18 Thread Xavier Stevens
Raghu,

I was using 0.17.2.1, but I installed 0.18.3 a couple of days ago.  I
also separated out my secondarynamenode and jobtracker to another
machine.  In addition my network operations people had misconfigured
some switches which ended up being my bottleneck.

After all of that my writer and Hadoop is working great.


-Xavier
 

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Wednesday, February 18, 2009 11:49 AM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop Write Performance


what is the hadoop version?

You could check log on a datanode around that time. You could post any
suspicious errors. For e.g. you can trace a particular block in client
and datanode logs.

Most likely it not a NameNode issue, but you can check NameNode log as
well.

Raghu.

Xavier Stevens wrote:
 Does anyone have an expected or experienced write speed to HDFS 
 outside of Map/Reduce?  Any recommendations on properties to tweak in 
 hadoop-site.xml?
  
 Currently I have a multi-threaded writer where each thread is writing 
 to a different file.  But after a while I get this:
  
 java.io.IOException: Could not get block locations. Aborting...
  at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(D
 FS
 Client.java:2081)
  at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.
 ja
 va:1702)
  at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCl
 ie
 nt.java:1818)
  
 Which is perhaps indicating that the namenode is overwhelmed?
  
  
 Thanks,
  
 -Xavier
 





Hadoop Performance

2009-02-17 Thread Xavier Stevens
I'm uploading a bunch of data  to a new cluster but I'm getting terrible
performance.  I'm using Hadoop 0.18.3 and my write speed is never
breaking 10 MB/s right now.  Does anyone know what might be wrong?  Or
have suggestions or alternatives to dfs -put?
 
Thanks,
 
 
-Xavier


RE: Hadoop Performance

2009-02-17 Thread Xavier Stevens
Disregard.  I discovered that this was a networking performance issue
unrelated to Hadoop.

-Xavier
 

-Original Message-
From: Xavier Stevens 
Sent: Tuesday, February 17, 2009 3:59 PM
To: core-user@hadoop.apache.org
Subject: Hadoop Performance

I'm uploading a bunch of data  to a new cluster but I'm getting terrible
performance.  I'm using Hadoop 0.18.3 and my write speed is never
breaking 10 MB/s right now.  Does anyone know what might be wrong?  Or
have suggestions or alternatives to dfs -put?
 
Thanks,
 
 
-Xavier



Hadoop Write Performance

2009-02-13 Thread Xavier Stevens
Does anyone have an expected or experienced write speed to HDFS outside
of Map/Reduce?  Any recommendations on properties to tweak in
hadoop-site.xml?
 
Currently I have a multi-threaded writer where each thread is writing to
a different file.  But after a while I get this:
 
java.io.IOException: Could not get block locations. Aborting...
 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
Client.java:2081)
 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.ja
va:1702)
 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
nt.java:1818)
 
Which is perhaps indicating that the namenode is overwhelmed?
 
 
Thanks,
 
-Xavier


Reading Protocol Buffers as BytesWritable value

2008-12-03 Thread Xavier Stevens
I am currently trying to read in values where I have previously output a 
Text,BytesWritable pair into .  The key is actually Hadoop's Text writable, and 
the value is a Protocol Buffer byte array output into a BytesWritable.  Here is 
a snippet showing the output configuration.

FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);
SequenceFileAsBinaryOutputFormat.setOutputCompressionType(job, 
SequenceFile.CompressionType.BLOCK);
SequenceFileAsBinaryOutputFormat.setSequenceFileOutputKeyClass(job, Text.class);
SequenceFileAsBinaryOutputFormat.setSequenceFileOutputValueClass(job, 
BytesWritable.class);
FileOutputFormat.setOutputPath(job, outDataPath);

In a another job I am trying to read this back in:

job.setInputFormat(org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat.class);

public static class Map extends MapReduceBase implements 
MapperText,BytesWritable,Text,LongWritable { ... }

I get an error like this:

java.io.IOException: 
hdfs://localhost:4000/user/myuser/step1-out/part-3.deflate not a 
SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1458)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1431)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1420)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1415)
at 
org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat$SequenceFileAsBinaryRecordReader.init(SequenceFileAsBinaryInputFormat.java:67)
at 
org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat.getRecordReader(SequenceFileAsBinaryInputFormat.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

Am I doing something wrong here?  Or is there just some inherent problem with 
what I am trying to do?


-Xavier 


RE: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory

2008-11-18 Thread Xavier Stevens
I'm still seeing this problem on a cluster using Hadoop 0.18.2.  I tried
dropping the max number of map tasks per node from 8 to 7.  I still get
the error although it's less frequent.  But I don't get the error at all
when using Hadoop 0.17.2.

Anyone have any suggestions?


-Xavier

-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Edward J. Yoon
Sent: Thursday, October 09, 2008 2:07 AM
To: core-user@hadoop.apache.org
Subject: Re: Cannot run program bash: java.io.IOException: error=12,
Cannot allocate memory

Thanks Alexander!!

On Thu, Oct 9, 2008 at 4:49 PM, Alexander Aristov
[EMAIL PROTECTED] wrote:
 I received such errors when I overloaded data nodes. You may increase 
 swap space or run less tasks.

 Alexander

 2008/10/9 Edward J. Yoon [EMAIL PROTECTED]

 Hi,

 I received below message. Can anyone explain this?

 08/10/09 11:53:33 INFO mapred.JobClient: Task Id :
 task_200810081842_0004_m_00_0, Status : FAILED
 java.io.IOException: Cannot run program bash: java.io.IOException:
 error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at

org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF
orWrite(LocalDirAllocator.java:296)
at

org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
cator.java:124)
at

org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil
e.java:107)
at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
va:734)
at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:694)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124
 ) Caused by: java.io.IOException: java.io.IOException: error=12, 
 Cannot allocate memory
at java.lang.UNIXProcess.init(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 10 more

 --
 Best regards, Edward J. Yoon
 [EMAIL PROTECTED]
 http://blog.udanax.org




 --
 Best Regards
 Alexander Aristov




--
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org




RE: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory

2008-11-18 Thread Xavier Stevens
1) It doesn't look like I'm out of memory but it is coming really close.
2) overcommit_memory is set to 2, overcommit_ratio = 100

As for the JVM, I am using Java 1.6.

**Note of Interest**: The virtual memory I see allocated in top for each
task is more than what I am specifying in the hadoop job/site configs.

Currently each physical box has 16 GB of memory.  I see the datanode and
tasktracker using: 

RESVIRT
Datanode145m   1408m
Tasktracker 206m   1439m

When idle.

So taking that into account I do 16000 MB - (1408+1439) MB which would
leave me with 13200 MB.  In my old settings I was using 8 map tasks  so
13200 / 8 = 1650 MB.

My mapred.child.java.opts is -Xmx1536m which should leave me a little
head room.

When running though I see some tasks reporting 1900m.


-Xavier


-Original Message-
From: Brian Bockelman [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 18, 2008 2:42 PM
To: core-user@hadoop.apache.org
Subject: Re: Cannot run program bash: java.io.IOException: error=12,
Cannot allocate memory

Hey Xavier,

1) Are you out of memory (dumb question, but doesn't hurt to ask...)?   
What does Ganglia tell you about the node?
2) Do you have /proc/sys/vm/overcommit_memory set to 2?

Telling Linux not to overcommit memory on Java 1.5 JVMs can be very
problematic.  Java 1.5 asks for min heap size + 1 GB of reserved, non-
swap memory on Linux systems by default.  The 1GB of reserved, non- swap
memory is used for the JIT to compile code; this bug wasn't fixed until
later Java 1.5 updates.

Brian

On Nov 18, 2008, at 4:32 PM, Xavier Stevens wrote:

 I'm still seeing this problem on a cluster using Hadoop 0.18.2.  I  
 tried
 dropping the max number of map tasks per node from 8 to 7.  I still  
 get
 the error although it's less frequent.  But I don't get the error at  
 all
 when using Hadoop 0.17.2.

 Anyone have any suggestions?


 -Xavier

 -Original Message-
 From: [EMAIL PROTECTED] On Behalf Of Edward J. Yoon
 Sent: Thursday, October 09, 2008 2:07 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Cannot run program bash: java.io.IOException: error=12,
 Cannot allocate memory

 Thanks Alexander!!

 On Thu, Oct 9, 2008 at 4:49 PM, Alexander Aristov
 [EMAIL PROTECTED] wrote:
 I received such errors when I overloaded data nodes. You may increase
 swap space or run less tasks.

 Alexander

 2008/10/9 Edward J. Yoon [EMAIL PROTECTED]

 Hi,

 I received below message. Can anyone explain this?

 08/10/09 11:53:33 INFO mapred.JobClient: Task Id :
 task_200810081842_0004_m_00_0, Status : FAILED
 java.io.IOException: Cannot run program bash: java.io.IOException:
 error=12, Cannot allocate memory
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
   at org.apache.hadoop.util.Shell.run(Shell.java:134)
   at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
   at

 org.apache.hadoop.fs.LocalDirAllocator 
 $AllocatorPerContext.getLocalPathF
 orWrite(LocalDirAllocator.java:296)
   at

 org 
 .apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
 cator.java:124)
   at

 org 
 .apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil
 e.java:107)
   at

 org.apache.hadoop.mapred.MapTask 
 $MapOutputBuffer.sortAndSpill(MapTask.ja
 va:734)
   at

 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java: 
 694)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
   at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
 2124
 ) Caused by: java.io.IOException: java.io.IOException: error=12,
 Cannot allocate memory
   at java.lang.UNIXProcess.init(UNIXProcess.java:148)
   at java.lang.ProcessImpl.start(ProcessImpl.java:65)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
   ... 10 more

 --
 Best regards, Edward J. Yoon
 [EMAIL PROTECTED]
 http://blog.udanax.org




 --
 Best Regards
 Alexander Aristov




 --
 Best regards, Edward J. Yoon
 [EMAIL PROTECTED]
 http://blog.udanax.org






Un-Blacklist Node

2008-08-14 Thread Xavier Stevens
Is there a way to un-blacklist a node without restarting hadoop?

Thanks,

-Xavier



RE: Un-Blacklist Node

2008-08-14 Thread Xavier Stevens
I don't think this is the per-job blacklist.  The datanode is still
running but the tasktracker isn't on the slave machine.  Can I just
start-mapred on that machine?

-Xavier
 

-Original Message-
From: Arun C Murthy [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 14, 2008 1:43 PM
To: core-user@hadoop.apache.org
Subject: Re: Un-Blacklist Node

Xavier,

On Aug 14, 2008, at 12:18 PM, Xavier Stevens wrote:

 Is there a way to un-blacklist a node without restarting hadoop?


Which blacklist are you talking about? Per-job blacklist of
TaskTrackers? Hadoop Daemons?

Arun





RE: java.io.IOException: Cannot allocate memory

2008-08-01 Thread Xavier Stevens
Actually I found the problem was our operations people had enabled
overcommit on memory and restricted it to 50%...lol.  Telling them to
make it 100% fixed the problem.

-Xavier


-Original Message-
From: Taeho Kang [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 31, 2008 6:16 PM
To: core-user@hadoop.apache.org
Subject: Re: java.io.IOException: Cannot allocate memory

Are you using HadoopStreaming?

If so, then subprocess created by HadoopStreaming Job can take as much
memory as it needs. In that case, the system will run out of memory and
other processes (e.g. TaskTracker) may not be able to run properly or
even be killed by the OS.

/Taeho

On Fri, Aug 1, 2008 at 2:24 AM, Xavier Stevens
[EMAIL PROTECTED]wrote:

 We're currently running jobs on machines with around 16GB of memory 
 with
 8 map tasks per machine.  We used to run with max heap set to 2048m.
 Since we started using version 0.17.1 we've been getting a lot of 
 these
 errors:

 task_200807251330_0042_m_000146_0: Caused by: java.io.IOException:
 java.io.IOException: Cannot allocate memory
 task_200807251330_0042_m_000146_0:  at
 java.lang.UNIXProcess.init(UNIXProcess.java:148)
 task_200807251330_0042_m_000146_0:  at
 java.lang.ProcessImpl.start(ProcessImpl.java:65)
 task_200807251330_0042_m_000146_0:  at
 java.lang.ProcessBuilder.start(ProcessBuilder.java:451)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.util.Shell.run(Shell.java:134)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPat
 hF
 orWrite(LocalDirAllocator.java:296)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAl
 lo
 cator.java:124)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputF
 il
 e.java:107)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.
 ja
 va:734)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.j
 av
 a:272)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTa
 sk
 .java:707)

 We haven't changed our heapsizes at all.  Has anyone else experienced 
 this?  Is there a way around it other than reducing heap sizes 
 excessively low?  I've tried all the way down to 1024m max heap and I 
 still get this error.


 -Xavier





java.io.IOException: Cannot allocate memory

2008-07-31 Thread Xavier Stevens
We're currently running jobs on machines with around 16GB of memory with
8 map tasks per machine.  We used to run with max heap set to 2048m.
Since we started using version 0.17.1 we've been getting a lot of these
errors:

task_200807251330_0042_m_000146_0: Caused by: java.io.IOException:
java.io.IOException: Cannot allocate memory
task_200807251330_0042_m_000146_0:  at
java.lang.UNIXProcess.init(UNIXProcess.java:148)
task_200807251330_0042_m_000146_0:  at
java.lang.ProcessImpl.start(ProcessImpl.java:65)
task_200807251330_0042_m_000146_0:  at
java.lang.ProcessBuilder.start(ProcessBuilder.java:451)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.util.Shell.run(Shell.java:134)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF
orWrite(LocalDirAllocator.java:296)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
cator.java:124)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil
e.java:107)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
va:734)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.jav
a:272)
task_200807251330_0042_m_000146_0:  at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
.java:707)

We haven't changed our heapsizes at all.  Has anyone else experienced
this?  Is there a way around it other than reducing heap sizes
excessively low?  I've tried all the way down to 1024m max heap and I
still get this error.


-Xavier



RE: client connect as different username?

2008-06-11 Thread Xavier Stevens
This is how I've done it before:

1.) Create a hadoop user/group.  
2.) Make the local filesystem dfs directories writable by the hadoop
group and set the sticky bit.  
3.) Run hadoop as the hadoop user.
4.) Then add all of your users to the hadoop group.  

I also changed the dfs.permissions.supergroup property to hadoop in
the $HADOOP_HOME/conf/hadoop-site.xml as well.

This works pretty well for us.  Hope it helps.

Cheers,

-Xavier 


-Original Message-
From: Chris Collins [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 5:18 PM
To: core-user@hadoop.apache.org
Subject: Re: client connect as different username?

The finer point to this is that in development you may be logged in as
user x and have a shared hdfs instance that a number of people are
using.  In that mode its not practical to sudo as you have all your
development tools setup for userx.  hdfs is setup with a single user,
what is the procedure to add users to that hdfs instance?  It has to
support it surely?  Its really not obvious, looking in the hdfs docs
that come with the distro nothing springs out.  the hadoop command line
tool doesnt have anything that vaguely looks like a way to create a
user.

Help is greatly appreciated.  I am sure its somewhere so blindingly
obvious.

How are other people doing other that sudoing to one single user name?

Thanks

ChRiS


On Jun 11, 2008, at 5:11 PM, [EMAIL PROTECTED] wrote:

 The best way is to use sudo command to execute hadoop client.  Does it

 work for you?

 Nicholas


 - Original Message 
 From: Bob Remeika [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Wednesday, June 11, 2008 12:56:14 PM
 Subject: client connect as different username?

 Apologies if this is an RTM response, but I looked and wasn't able to

 find anything concrete.  Is it possible to connect to HDFS via the 
 HDFS client under a different username than I am currently logged in 
 as?

 Here is our situation, I am user bobr on the client machine.  I need 
 to add something to the HDFS cluster as the user companyuser.  Is 
 this possible with the current set of APIs or do I have to upload and

 chown?

 Thanks,
 Bob






RE: FileSystem.create

2008-05-15 Thread Xavier Stevens
I don't even write successfully once.  I could put a Thread.sleep after
I call create to see if that works.  This seems like a hack though.
Further in the stack trace I start to get a not replicated yet exception
after the retry logic kicks in.  Why does the API return me a stream
that I can't use?

-Xavier


-Original Message-
From: Runping Qi 
Sent: Wednesday, May 14, 2008 9:02 PM
To: core-user@hadoop.apache.org
Subject: RE: FileSystem.create 


My experience is to call Thread.sleep(100) after calling dfs writes N
(say 1000) times.

 -Original Message-
 From: Xavier Stevens
 Sent: Wednesday, May 14, 2008 10:47 AM
 To: core-user@hadoop.apache.org
 Subject: FileSystem.create
 
 I've having some problems creating a new file on HDFS.  I am
attempting
 to do this after my MapReduce job has finished and I am trying to 
 combine all part-00* files into a single file programmatically.  It's 
 throwing a LeaseExpiredException saying the file I just created
doesn't
 exist.  Any idea why this is happening or what I can do to fix it?
 
 -Xavier
 
 Here is the code snippet


 ===
 FileSystem fileSys = FileSystem.get(job); FSDataOutputStream fsdos = 
 fileSys.create(new Path(outputIndexPath)); if (!fileSys.exists(new 
 Path(outputIndexPath))) {
   System.err.println(File still does not exist: +outputIndexPath); }

 else {
   System.out.println(File exists: +outputIndexPath); } 
 BufferedWriter writer = new BufferedWriter(new 
 OutputStreamWriter(fsdos,UTF-8));
 
 Output with stack trace


 ===
 File exists: output/index.txt
 08/05/14 03:20:13 INFO dfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.dfs.LeaseExpiredException: No lease on 
 /user/xstevens/output/index.txt File does not exist. [Lease.  Holder:
44
 46 53 43 6c 69 65 6e 74 5f 2d 31 30 31 34 35 38 35 32 32 33,
heldlocks:
 0, pendingcreates: 1]
 
 at
org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1160)
 at

org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:
 1097)
 at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:312)
 at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
 at org.apache.hadoop.ipc.Client.call(Client.java:512)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
 at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at 
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
 a:39)
 at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at

org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo
 cationHandler.java:82)
 at

org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation
 Handler.java:59)
 at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFS
 Client.java:2074)
 at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DF
 SClient.java:1967)
 at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1500(DFSClient.ja
 va:1487)
 at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
 nt.java:1601)
 
 
 




FileSystem.create

2008-05-14 Thread Xavier Stevens
I've having some problems creating a new file on HDFS.  I am attempting
to do this after my MapReduce job has finished and I am trying to
combine all part-00* files into a single file programmatically.  It's
throwing a LeaseExpiredException saying the file I just created doesn't
exist.  Any idea why this is happening or what I can do to fix it?
 
-Xavier
 
Here is the code snippet

===
FileSystem fileSys = FileSystem.get(job);
FSDataOutputStream fsdos = fileSys.create(new Path(outputIndexPath)); 
if (!fileSys.exists(new Path(outputIndexPath))) {
  System.err.println(File still does not exist: +outputIndexPath);
} else {
  System.out.println(File exists: +outputIndexPath); 
} 
BufferedWriter writer = new BufferedWriter(new
OutputStreamWriter(fsdos,UTF-8)); 

Output with stack trace

===
File exists: output/index.txt
08/05/14 03:20:13 INFO dfs.DFSClient:
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/user/xstevens/output/index.txt File does not exist. [Lease.  Holder: 44
46 53 43 6c 69 65 6e 74 5f 2d 31 30 31 34 35 38 35 32 32 33, heldlocks:
0, pendingcreates: 1]
 
at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1160)
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:
1097)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:312)
at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
at org.apache.hadoop.ipc.Client.call(Client.java:512)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo
cationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation
Handler.java:59)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFS
Client.java:2074)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DF
SClient.java:1967)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1500(DFSClient.ja
va:1487)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
nt.java:1601)
 
 
Xavier stevens
Sr. software engineer
FOX INTERACTIVE MEDIA
e: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
p: 310.633.9749
 
 


Problem retrieving entry from compressed MapFile

2008-03-13 Thread Xavier Stevens
Currently I can retrieve entries if I use MapFileOutputFormat via 
conf.setOutputFormat with no compression specified.  But I was trying to do 
this:

public void configure(JobConf jobConf) {
...
this.writer = new MapFile.Writer(jobConf, fileSys, dirName, Text.class, 
Text.class, SequenceFile.CompressionType.BLOCK);
...
}

public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter) throws 
IOException {
...
writer.append(newkey,newvalue);
...
}

To use SequenceFile block compression.  Then later trying to retrieve the 
output values in a separate class:

public static void main(String[] args) throws Exception {
...
conf.setInputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat.class);
...
MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys, inDataPath, 
defaults);
Partitioner part = 
(Partitioner)ReflectionUtils.newInstance(conf.getPartitionerClass(), conf);
Text entryValue = null;
entryValue = (Text)MapFileOutputFormat.getEntry(readers, part, new 
Text(mykey), new Text());
if (entryValue != null) {
System.out.println(My Entry's Value: );
System.out.println(entryValue.toString());
}
for (MapFile.Reader reader:readers) {
if (reader != null) {
reader.close();
}
}
}

But when I use block compression I no longer get a result from 
MapFileOutputFormat.getEntry.  What am I doing wrong?  And/or is there a way 
for this to work using conf.setOutputFormat(MapFileOutputFormat.class) and 
conf.setMapOutputCompressionType(SequenceFile.CompressionType.BLOCK)? 


RE: What's the best way to get to a single key?

2008-03-10 Thread Xavier Stevens
I was thinking because it would be easier to search a single-index.
Unless I don't have to worry and hadoop searches all my indexes at the
same time.  Is this the case?

-Xavier
 

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 3:45 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
 Thanks for everything so far.  It has been really helpful.  I have one

 more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ
enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[],%20org.apache.had
oop.fs.Path,%20boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug




RE: What's the best way to get to a single key?

2008-03-10 Thread Xavier Stevens
So I read some more through the Javadocs.  I had 11 reducers on my original job 
leaving me 11 MapFile directories.  I am passing in their parent directory here 
as outDir.

MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys, outDir, 
defaults);
Partitioner part = 
(Partitioner)ReflectionUtils.newInstance(conf.getPartitionerClass(), conf);
Text entryValue = (Text)MapFileOutputFormat.getEntry(readers, part, new 
Text(mykey), null);
System.out.println(My Entry's Value: );
System.out.println(entryValue.toString());

But I am getting an exception:

Exception in thread main java.lang.ArithmeticException: / by zero
at 
org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartitioner.java:35)
at 
org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:85)
at mypackage.MyClass.main(ProfileReader.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

I am assuming I am doing something wrong, but I'm not sure what it is yet.  Any 
ideas?


-Xavier


-Original Message-
From: Xavier Stevens
Sent: Mon 3/10/2008 3:49 PM
To: core-user@hadoop.apache.org
Subject: RE: What's the best way to get to a single key?
 
I was thinking because it would be easier to search a single-index.
Unless I don't have to worry and hadoop searches all my indexes at the
same time.  Is this the case?

-Xavier
 

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 3:45 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
 Thanks for everything so far.  It has been really helpful.  I have one

 more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ
enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[], org.apache.had
oop.fs.Path, boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug




 


RE: What's the best way to get to a single key?

2008-03-06 Thread Xavier Stevens
Thanks for everything so far.  It has been really helpful.  I have one
more question.  Is there a way to merge MapFile index/data files?
Assuming there is, what is the best way to do so?  I was reading the
Java docs on it and it looked like this is possible but it wasn't very
explicit.  Obviously I could specify to use a single reducer, but with
my data size that would be really slow.

Thanks,

-Xavier


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 12:53 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
 Is there a way to do this when your input data is using SequenceFile 
 compression?

Yes.  A MapFile is simply a directory containing two SequenceFiles named
data and index.  MapFileOutputFormat uses the same compression
parameters as SequenceFileOutputFormat.  SequenceFileInputFormat
recognizes MapFiles and reads the data file.  So you should be able to
just switch from specifying SequenceFileOutputFormat to
MapFileOutputFormat in your jobs and everything should work the same
except you'll have index files that permit random access.

Doug




What's the best way to get to a single key?

2008-03-03 Thread Xavier Stevens
I am curious how others might be solving this problem.  I want to
retrieve a record from HDFS based on its key.  Are there any methods
that can shortcut this type of search to avoid parsing all data until
you find it?  Obviously Hbase would do this as well, but I wanted to
know if there is a way to do it using just Map/Reduce and HDFS.

Thanks,

-Xavier