Re: stable version

2009-02-11 Thread Rasit OZDAS
Yes, version 18.3 is the most stable one. It has added patches,
without not-proven new functionality.

2009/2/11 Owen O'Malley omal...@apache.org:

 On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote:

 Maybe version 0.18
 is better suited for production environment?

 Yahoo is mostly on 0.18.3 + some patches at this point.

 -- Owen




-- 
M. Raşit ÖZDAŞ


Re: Reporter for Hadoop Streaming?

2009-02-11 Thread Tom White
You can retrieve them from the command line using

bin/hadoop job -counter job-id group-name counter-name

Tom

On Wed, Feb 11, 2009 at 12:20 AM, scruffy323 steve.mo...@gmail.com wrote:

 Do you know how to access those counters programmatically after the job has
 run?


 S D-5 wrote:

 This does it. Thanks!

 On Thu, Feb 5, 2009 at 9:14 PM, Arun C Murthy a...@yahoo-inc.com wrote:


 On Feb 5, 2009, at 1:40 PM, S D wrote:

  Is there a way to use the Reporter interface (or something similar such
 as
 Counters) with Hadoop streaming? Alternatively, can how could STDOUT be
 intercepted for the purpose of updates? If anyone could point me to
 documentation or examples that cover this I'd appreciate it.



 http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+counters+in+streaming+applications%3F

 http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+status+in+streaming+applications%3F

 Arun




 --
 View this message in context: 
 http://www.nabble.com/Reporter-for-Hadoop-Streaming--tp21861786p21945843.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: anybody knows an apache-license-compatible impl of Integer.parseInt?

2009-02-11 Thread Steve Loughran

Zheng Shao wrote:

We need to implement a version of Integer.parseInt/atoi from byte[] instead of 
String to avoid the high cost of creating a String object.

I wanted to take the open jdk code but the license is GPL:
http://www.docjar.com/html/api/java/lang/Integer.java.html

Does anybody know an implementation that I can use for hive (apache license)?


I also need to do it for Byte, Short, Long, and Double. Just don't want to go 
over all the corner cases.


Use the Apache Harmony code
http://svn.apache.org/viewvc/harmony/enhanced/classlib/branches/java6/modules/


Re: File Transfer Rates

2009-02-11 Thread Steve Loughran

Brian Bockelman wrote:
Just to toss out some numbers (and because our users are making 
interesting numbers right now)


Here's our external network router: 
http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets 



Here's the application-level transfer graph: 
http://t2.unl.edu/phedex/graphs/quantity_rates?link=srcno_mss=trueto_node=Nebraska 



In a squeeze, we can move 20-50TB / day to/from other heterogenous 
sites.  Usually, we run out of free space before we can find the upper 
limit for a 24-hour period.


We use a protocol called GridFTP to move data back and forth between 
external (non-HDFS) clusters.  The other sites we transfer with use 
niche software you probably haven't heard of (Castor, DPM, and dCache) 
because, well, it's niche software.  I have no available data on 
HDFS-S3 systems, but I'd again claim it's mostly a function of the 
amount of hardware you throw at it and the size of your network pipes.


There are currently 182 datanodes; 180 are traditional ones of 3TB 
and 2 are big honking RAID arrays of 40TB.  Transfers are load-balanced 
amongst ~ 7 GridFTP servers which each have 1Gbps connection.




GridFTP is optimised for high bandwidth network connections with 
negotiated packet size and multiple TCP connections, so when nagel's 
algorithm triggers backoff from a dropped packet, only a fraction of the 
transmission gets dropped. It is probably best-in-class for long haul 
transfers over the big university backbones where someone else pays for 
your traffic. You would be very hard pressed to get even close to that 
on any other protocol.


I have no data on S3 xfers other than hearsay
 * write time to S3 can be slow as it doesn't return until the data is 
persisted somewhere. That's a better guarantee than a posix write 
operation.
 * you have to rely on other people on your rack not wanting all the 
traffic for themselves. That's an EC2 API issue: you don't get to 
request/buy bandwidth to/from S3


One thing to remember is that if you bring up a Hadoop cluster on any 
virtual server farm, disk IO is going to be way below physical IO rates. 
Even when the data is in HDFS, it will be slower to get at than 
dedicated high-RPM SCSI or SATA storage.


Hadoop setup questions

2009-02-11 Thread bjday

Good morning everyone,

I have a question about correct setup for hadoop.  I have 14 Dell 
computers in a lab.   Each connected to the internet and each 
independent of each other.  All run CentOS.  Logins are handled by NIS.  
If userA logs into the master and starts the daemons and UserB logs into 
the master and wants to run a job while the daemons from UserA are still 
running the following error occurs:


copyFromLocal: org.apache.hadoop.security.AccessControlException: 
Permission denied: user=UserB, access=WRITE, 
inode=user:UserA:supergroup:rwxr-xr-x


what needs to be changed to allow UserB-UserZ to run their jobs?  Does 
there need to be a local user the everyone logs into as and run from 
there?  Should Hadoop be ran in an actual cluster instead of independent 
computers?  Any ideas what is the correct configuration settings that 
allow it?


I followed Ravi Phulari suggestions and followed:

http://hadoop.apache.org/core/docs/current/quickstart.html
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) 
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) 
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29


These allowed me to get Hadoop running on the 14 computers when I login 
and everything works fine, thank you Ravi.  The problem occurs when 
additional people attempt to run jobs simultaneously.


Thank you,

Brian



Re: stable version

2009-02-11 Thread Vadim Zaliva
The particular problem I am having is this one:

https://issues.apache.org/jira/browse/HADOOP-2669

I am observing it in version 19. Could anybody confirm that
it have been fixed in 18, as Jira claims?

I am wondering why bug fix for this problem might have been committed
to 18 branch but not 19. If it was commited to both, then perhaps the
problem was not completely solved and downgrading to 18 will not help
me.

Vadim

On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote:
 Yes, version 18.3 is the most stable one. It has added patches,
 without not-proven new functionality.

 2009/2/11 Owen O'Malley omal...@apache.org:

 On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote:

 Maybe version 0.18
 is better suited for production environment?

 Yahoo is mostly on 0.18.3 + some patches at this point.

 -- Owen




 --
 M. Raşit ÖZDAŞ



Finding small subset in very large dataset

2009-02-11 Thread Thibaut_

Hi,

Let's say the smaller subset has name A. It is a relatively small collection
 100 000 entries (could also be only 100), with nearly no payload as value. 
Collection B is a big collection with 10 000 000 entries (Each key of A
also exists in the collection B), where the value for each key is relatively
big ( 100 KB)

For all the keys in A, I need to get the corresponding value from B and
collect it in the output.


- I can do this by reading in both files, and on the reduce step, do my
computations and collect only those which are both in A and B. The map phase
however will take very long as all the key/value pairs of collection B need
to be sorted (and each key's value is 100 KB) at the end of the map phase,
which is overkill if A is very small.

What I would need is an option to somehow make the intersection first
(Mapper only on keys, then a reduce functino based only on keys and not the
corresponding values which collects the keys I want to take), and then
running the map input and filtering the output collector or the input based
on the results from the reduce phase.

Or is there another faster way? Collection A could be so big that it doesn't
fit into the memory. I could split collection A up into multiple smaller
collections, but that would make it more complicated, so I want to evade
that route. (This is similar to the approach I described above, just a
manual approach)

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: stable version

2009-02-11 Thread Raghu Angadi

Vadim Zaliva wrote:

The particular problem I am having is this one:

https://issues.apache.org/jira/browse/HADOOP-2669

I am observing it in version 19. Could anybody confirm that
it have been fixed in 18, as Jira claims?

I am wondering why bug fix for this problem might have been committed
to 18 branch but not 19. If it was commited to both, then perhaps the
problem was not completely solved and downgrading to 18 will not help
me.


If you read through the comments, it will see that the the root cause 
was never found. The patch just fixes one of the suspects. If you are 
still seeing this, please file another jira and link it HADOOP-2669.


How easy is it for you reproduce this? I guess one of the reasons for 
incomplete diagnosis is that it is not simple to reproduce.


Raghu.


Vadim

On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote:

Yes, version 18.3 is the most stable one. It has added patches,
without not-proven new functionality.

2009/2/11 Owen O'Malley omal...@apache.org:

On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote:


Maybe version 0.18
is better suited for production environment?

Yahoo is mostly on 0.18.3 + some patches at this point.

-- Owen




--
M. Raşit ÖZDAŞ





Re: Finding small subset in very large dataset

2009-02-11 Thread Amit Chandel
Are the keys in collection B unique?

If so, I would like to try this approach:
For each key, value of collection B, make a file out of it with file name
given by MD5 hash of the key, and value being its content, and then
store all these files into a HAR archive.
The HAR archive will create an index for you over the keys.
Now you can iterate over the collection A, get the MD5 hash of the key, and
look up in the archive for the file (to get the value).

On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote:


 Hi,

 Let's say the smaller subset has name A. It is a relatively small
 collection
  100 000 entries (could also be only 100), with nearly no payload as
 value.
 Collection B is a big collection with 10 000 000 entries (Each key of A
 also exists in the collection B), where the value for each key is
 relatively
 big ( 100 KB)

 For all the keys in A, I need to get the corresponding value from B and
 collect it in the output.


 - I can do this by reading in both files, and on the reduce step, do my
 computations and collect only those which are both in A and B. The map
 phase
 however will take very long as all the key/value pairs of collection B need
 to be sorted (and each key's value is 100 KB) at the end of the map phase,
 which is overkill if A is very small.

 What I would need is an option to somehow make the intersection first
 (Mapper only on keys, then a reduce functino based only on keys and not the
 corresponding values which collects the keys I want to take), and then
 running the map input and filtering the output collector or the input based
 on the results from the reduce phase.

 Or is there another faster way? Collection A could be so big that it
 doesn't
 fit into the memory. I could split collection A up into multiple smaller
 collections, but that would make it more complicated, so I want to evade
 that route. (This is similar to the approach I described above, just a
 manual approach)

 Thanks,
 Thibaut
 --
 View this message in context:
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




can't edit the file that mounted by fuse_dfs by editor

2009-02-11 Thread zhuweimin
Hey all

I was trying to edit the file that mounted by fuse_dfs by vi editor, but the
contents could not save.
The command is like the following:
[had...@vm-centos-5-shu-4 src]$ vi /mnt/dfs/test.txt
The error message from system log (/var/log/messages) is the following:
Feb 12 09:53:48 VM-CentOS-5-SHU-4 fuse_dfs: ERROR: could not connect open
file fuse_dfs.c:1340

I using the hadoop0.19.0 and fuse-dfs version 26 with centos5.2.
Does anyone have an idea as to what could be wrong!

Thanks!
zhuweimin




Re: Finding small subset in very large dataset

2009-02-11 Thread Aaron Kimball
I don't see why a HAR archive needs to be involved. You can use a MapFile to
create a scannable index over a SequenceFile and do lookups that way.

But if A is small enough to fit in RAM, then there is a much simpler way:
Write it out to a file and disseminate to all mappers via the
DistributedCache. They then reach read in the entire A set into a HashSet or
other data structure during configure(), before they scan through their
slices of B. They then emit only the B values which hit in A. This is called
a map-side join. If you don't care about sorted ordering of your results,
you can then disable the reducers entirely.

Hive already supports this behavior; but you have to explicitly tell it to
enable map-side joins for each query because only you know that one data set
is small enough ahead of time.

If your A set doesn't fit in RAM, you'll need to get more creative. One
possibility is to do the same thing as above, but instead of reading all of
A into memory, use a hash function to squash the keys from A into some
bounded amount of RAM. For example, allocate yourself a 256 MB bitvector;
for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for
each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1,
then it may match an A key; if it's 0 then it definitely does not match an A
key. For each potential match, send it to the reducer. Send all the A keys
to the reducer as well, where the  precise joining will occur. (Note: this
is effectively the same thing as a Bloom Filter.)

This will send much less data to each reducer and should see better
throughput.

- Aaron


On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel amitchan...@gmail.com wrote:

 Are the keys in collection B unique?

 If so, I would like to try this approach:
 For each key, value of collection B, make a file out of it with file name
 given by MD5 hash of the key, and value being its content, and then
 store all these files into a HAR archive.
 The HAR archive will create an index for you over the keys.
 Now you can iterate over the collection A, get the MD5 hash of the key, and
 look up in the archive for the file (to get the value).

 On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote:

 
  Hi,
 
  Let's say the smaller subset has name A. It is a relatively small
  collection
   100 000 entries (could also be only 100), with nearly no payload as
  value.
  Collection B is a big collection with 10 000 000 entries (Each key of A
  also exists in the collection B), where the value for each key is
  relatively
  big ( 100 KB)
 
  For all the keys in A, I need to get the corresponding value from B and
  collect it in the output.
 
 
  - I can do this by reading in both files, and on the reduce step, do my
  computations and collect only those which are both in A and B. The map
  phase
  however will take very long as all the key/value pairs of collection B
 need
  to be sorted (and each key's value is 100 KB) at the end of the map
 phase,
  which is overkill if A is very small.
 
  What I would need is an option to somehow make the intersection first
  (Mapper only on keys, then a reduce functino based only on keys and not
 the
  corresponding values which collects the keys I want to take), and then
  running the map input and filtering the output collector or the input
 based
  on the results from the reduce phase.
 
  Or is there another faster way? Collection A could be so big that it
  doesn't
  fit into the memory. I could split collection A up into multiple smaller
  collections, but that would make it more complicated, so I want to evade
  that route. (This is similar to the approach I described above, just a
  manual approach)
 
  Thanks,
  Thibaut
  --
  View this message in context:
 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Reducer Out of Memory

2009-02-11 Thread Kris Jirapinyo
Hi all,
I am running a data-intensive job on 18 nodes on EC2, each with just
1.7GB of memory.  The input size is 50GB, and as a result, my mapper splits
it up automatically to 786 map tasks.  This runs fine.  However, I am
setting the reduce task number to 18.  This is where I get a java heap out
of memory error:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.(String.java:216)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at org.apache.hadoop.io.Text.decode(Text.java:350)
at org.apache.hadoop.io.Text.decode(Text.java:327)
at org.apache.hadoop.io.Text.toString(Text.java:254)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
at org.apache.hadoop.mapred.Child.main(Child.java:155)


Re: Reducer Out of Memory

2009-02-11 Thread Rocks Lei Wang
Maybe you need allocate larger vm- memory to use parameter -Xmx1024m

On Thu, Feb 12, 2009 at 10:56 AM, Kris Jirapinyo kjirapi...@biz360.comwrote:

 Hi all,
I am running a data-intensive job on 18 nodes on EC2, each with just
 1.7GB of memory.  The input size is 50GB, and as a result, my mapper splits
 it up automatically to 786 map tasks.  This runs fine.  However, I am
 setting the reduce task number to 18.  This is where I get a java heap out
 of memory error:

 java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.(String.java:216)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at org.apache.hadoop.io.Text.decode(Text.java:350)
at org.apache.hadoop.io.Text.decode(Text.java:327)
at org.apache.hadoop.io.Text.toString(Text.java:254)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
at org.apache.hadoop.mapred.Child.main(Child.java:155)



Re: Reducer Out of Memory

2009-02-11 Thread Kris Jirapinyo
Darn that send button.

Anyways, so I was wondering if my understanding is correct.  There will only
be the exact same number of output files as the number of reducer tasks I
set.  Thus, in my output directory from the reducer, I should always see
only 18 files.  However, if my understanding is correct, then when I call
the output.collect() in my reducer, does it only get flushed at the end
when that particular reducer task finishes?  If that is the case, then it
does seem like as my input grow, 18 reducers will not be able to handle the
sheer volume of my data, as the collector will keep having to add more and
more data to it.

Thus, I guess this is the question.  Do I have to keep increasing the number
of reduce tasks so that the reducer can take smaller bites out of the
chunk?  Thus, if I'm running out of java heap space and I don't want to add
more nodes, then I need to set my reducer task number to say 36, etc.?  It
just seems like I'm missing something.

Of course, I could always add more nodes or upgrade to a higher instance so
I get more memory, but that's the obvious solution (I just hope it's not the
only solution).  I guess what I'm saying is that I thought the reducer would
be kind of smart enough to know that it's taking too big of a bite out of
the whole chunk (like the mapper) and readjust itself, as I don't really
care how many output files I get in the end, just that the result from the
reducer stays under one directory.


On Wed, Feb 11, 2009 at 6:56 PM, Kris Jirapinyo kjirapi...@biz360.comwrote:

 Hi all,
 I am running a data-intensive job on 18 nodes on EC2, each with just
 1.7GB of memory.  The input size is 50GB, and as a result, my mapper splits
 it up automatically to 786 map tasks.  This runs fine.  However, I am
 setting the reduce task number to 18.  This is where I get a java heap out
 of memory error:

 java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOfRange(Arrays.java:3209)
   at java.lang.String.(String.java:216)
   at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
   at java.nio.CharBuffer.toString(CharBuffer.java:1157)

   at org.apache.hadoop.io.Text.decode(Text.java:350)
   at org.apache.hadoop.io.Text.decode(Text.java:327)
   at org.apache.hadoop.io.Text.toString(Text.java:254)

   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)

   at org.apache.hadoop.mapred.Child.main(Child.java:155)





Re: Reducer Out of Memory

2009-02-11 Thread Kris Jirapinyo
I tried that, but with 1.7GB, that will not allow me to run 1 mapper and 1
reducer concurrently (as I think when you do -Xmx1024m it tries to reserve
that physical memory?).  Thus, to be safe, I set it to -Xmx768m.

The error I get when I do 1024m is this:

java.io.IOException: Cannot run program bash: java.io.IOException:
error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2079)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:457)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 10 more




On Wed, Feb 11, 2009 at 7:02 PM, Rocks Lei Wang beyiw...@gmail.com wrote:

 Maybe you need allocate larger vm- memory to use parameter -Xmx1024m

 On Thu, Feb 12, 2009 at 10:56 AM, Kris Jirapinyo kjirapi...@biz360.com
 wrote:

  Hi all,
 I am running a data-intensive job on 18 nodes on EC2, each with just
  1.7GB of memory.  The input size is 50GB, and as a result, my mapper
 splits
  it up automatically to 786 map tasks.  This runs fine.  However, I am
  setting the reduce task number to 18.  This is where I get a java heap
 out
  of memory error:
 
  java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOfRange(Arrays.java:3209)
 at java.lang.String.(String.java:216)
 at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
 at java.nio.CharBuffer.toString(CharBuffer.java:1157)
 at org.apache.hadoop.io.Text.decode(Text.java:350)
 at org.apache.hadoop.io.Text.decode(Text.java:327)
 at org.apache.hadoop.io.Text.toString(Text.java:254)
 
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
 at org.apache.hadoop.mapred.Child.main(Child.java:155)
 



Re: Hadoop setup questions

2009-02-11 Thread Amar Kamat

bjday wrote:

Good morning everyone,

I have a question about correct setup for hadoop.  I have 14 Dell 
computers in a lab.   Each connected to the internet and each 
independent of each other.  All run CentOS.  Logins are handled by 
NIS.  If userA logs into the master and starts the daemons and UserB 
logs into the master and wants to run a job while the daemons from 
UserA are still running the following error occurs:


copyFromLocal: org.apache.hadoop.security.AccessControlException: 
Permission denied: user=UserB, access=WRITE, 
inode=user:UserA:supergroup:rwxr-xr-x
Looks like one of your files (input or output) is of different user. 
Seems like your DFS has permissions enabled. If you dont require 
permissions then disable it else make sure that the input/output paths 
are under your permission (/user/userB is the hone directory for userB).

Amar


what needs to be changed to allow UserB-UserZ to run their jobs?  Does 
there need to be a local user the everyone logs into as and run from 
there?  Should Hadoop be ran in an actual cluster instead of 
independent computers?  Any ideas what is the correct configuration 
settings that allow it?


I followed Ravi Phulari suggestions and followed:

http://hadoop.apache.org/core/docs/current/quickstart.html
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) 
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) 
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 



These allowed me to get Hadoop running on the 14 computers when I 
login and everything works fine, thank you Ravi.  The problem occurs 
when additional people attempt to run jobs simultaneously.


Thank you,

Brian





Re: Hadoop setup questions

2009-02-11 Thread james warren
Like Amar said.  Try adding

property
namedfs.permissions/name
valuefalse/value
/property


to your conf/hadoop-site.xml file (or flip the value in hadoop-default.xml),
restart your daemons and give it a whirl.

cheers,
-jw

On Wed, Feb 11, 2009 at 8:44 PM, Amar Kamat ama...@yahoo-inc.com wrote:

 bjday wrote:

 Good morning everyone,

 I have a question about correct setup for hadoop.  I have 14 Dell
 computers in a lab.   Each connected to the internet and each independent of
 each other.  All run CentOS.  Logins are handled by NIS.  If userA logs into
 the master and starts the daemons and UserB logs into the master and wants
 to run a job while the daemons from UserA are still running the following
 error occurs:

 copyFromLocal: org.apache.hadoop.security.AccessControlException:
 Permission denied: user=UserB, access=WRITE,
 inode=user:UserA:supergroup:rwxr-xr-x

 Looks like one of your files (input or output) is of different user. Seems
 like your DFS has permissions enabled. If you dont require permissions then
 disable it else make sure that the input/output paths are under your
 permission (/user/userB is the hone directory for userB).
 Amar


 what needs to be changed to allow UserB-UserZ to run their jobs?  Does
 there need to be a local user the everyone logs into as and run from there?
  Should Hadoop be ran in an actual cluster instead of independent computers?
  Any ideas what is the correct configuration settings that allow it?

 I followed Ravi Phulari suggestions and followed:

 http://hadoop.apache.org/core/docs/current/quickstart.html

 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29


 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29


 These allowed me to get Hadoop running on the 14 computers when I login
 and everything works fine, thank you Ravi.  The problem occurs when
 additional people attempt to run jobs simultaneously.

 Thank you,

 Brian





Re: Loading native libraries

2009-02-11 Thread Rasit OZDAS
I have also the same problem.
It would be wonderful if someone has some info about this..

Rasit

2009/2/10 Mimi Sun m...@rapleaf.com:
 I see UnsatisfiedLinkError.  Also I'm calling
  System.getProperty(java.library.path) in the reducer and logging it. The
 only thing that prints out is
 ...hadoop-0.18.2/bin/../lib/native/Mac_OS_X-i386-32
 I'm using Cascading, not sure if that affects anything.

 - Mimi

 On Feb 10, 2009, at 11:40 AM, Arun C Murthy wrote:


 On Feb 10, 2009, at 11:06 AM, Mimi Sun wrote:

 Hi,

 I'm new to Hadoop and I'm wondering what the recommended method is for
 using native libraries in mapred jobs.
 I've tried the following separately:
 1. set LD_LIBRARY_PATH in .bashrc
 2. set LD_LIBRARY_PATH and  JAVA_LIBRARY_PATH in hadoop-env.sh
 3. set -Djava.library.path=... for mapred.child.java.opts

 For what you are trying (i.e. given that the JNI libs are present on all
 machines at a constant path) setting -Djava.library.path for the child task
 via mapred.child.java.opts should work. What are you seeing?

 Arun


 4. change bin/hadoop to include  $LD_LIBRARY_PATH in addition to the path
 it generates:  HADOOP_OPTS=$HADOOP_OPTS
 -Djava.library.path=$LD_LIBRARY_PATH:$JAVA_LIBRARY_PATH
 5. drop the .so files I need into hadoop/lib/native/...

 1~3 didn't work, 4 and 5 did but seem to be hacks. I also read that I can
 do this using DistributedCache, but that seems to be extra work for loading
 libraries that are already present on each machine. (I'm using the JNI libs
 for berkeley db).
 It seems that there should be a way to configure java.library.path for
 the mapred jobs.  Perhaps bin/hadoop should make use of LD_LIBRARY_PATH?

 Thanks,
 - Mimi






-- 
M. Raşit ÖZDAŞ


Re: Loading native libraries

2009-02-11 Thread Arun C Murthy


On Feb 10, 2009, at 12:24 PM, Mimi Sun wrote:

I see UnsatisfiedLinkError.  Also I'm calling   
System.getProperty(java.library.path) in the reducer and logging  
it. The only thing that prints out is ...hadoop-0.18.2/bin/../lib/ 
native/Mac_OS_X-i386-32

I'm using Cascading, not sure if that affects anything.



Hmm... that's odd. The framework does try to pass the user provided  
java.library.path down to the launched jvm. I assume your  
mapred.child.java.opts looks something like

-Xmx heapsize -Djava.library.path=path ?

Arun



Re: what's going on :( ?

2009-02-11 Thread Rasit OZDAS
Hi, Mark

Try to add an extra property to that file, and try to examine if
hadoop recognizes it.
This way you can find out if hadoop uses your configuration file.

2009/2/10 Jeff Hammerbacher ham...@cloudera.com:
 Hey Mark,

 In NameNode.java, the DEFAULT_PORT specified for NameNode RPC is 8020.
 From my understanding of the code, your fs.default.name setting should
 have overridden this port to be 9000. It appears your Hadoop
 installation has not picked up the configuration settings
 appropriately. You might want to see if you have any Hadoop processes
 running and terminate them (bin/stop-all.sh should help) and then
 restart your cluster with the new configuration to see if that helps.

 Later,
 Jeff

 On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote:
 Mark Kerzner wrote:

 Hi,
 Hi,

 why is hadoop suddenly telling me

  Retrying connect to server: localhost/127.0.0.1:8020

 with this configuration

 configuration
  property
    namefs.default.name/name
    valuehdfs://localhost:9000/value
  /property
  property
    namemapred.job.tracker/name
    valuelocalhost:9001/value


 Shouldnt this be

 valuehdfs://localhost:9001/value

 Amar

  /property
  property
    namedfs.replication/name
    value1/value
  /property
 /configuration

 and both this http://localhost:50070/dfshealth.jsp and this
 http://localhost:50030/jobtracker.jsp links work fine?

 Thank you,
 Mark








-- 
M. Raşit ÖZDAŞ