Couple of basic hdfs starter issues

2008-06-07 Thread chris collins
Sorry in advance if these "challenges" are covered in a document somewhere.

I have setup hadoop on a centos 64 bit Linux box.  I have verified that it is 
up and running only through seeing the java processes running and that I can 
access it from the admin ui.

hadoop version is 1.7.0 but I also tried 1.6.4 for the following issue:

>From a mac osx box using java 1.5 I am trying to run the following:

String home = "hdfs://linuxbox:9000";
URI uri = new URI(home);
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(uri, conf);

The call to FileSystem.get throws an IOException stating that there is a login 
error with message "whoami".

When I single step through the code there is an attempt to figure out what user 
is running this process by creating a processbuilder with "whoami".  This fails 
with a "not found" error.  I believe this is because you have to have a fully 
qualified path for processbuilder on the mac???

I also verified that my hadoop-default.xml and hadoop-site.xml is infact found 
in the classpath.

All this is being attempted via a debug session in intellij ide.

Any ideas on what I am doing wrong, I am sure its a configuration blunder on my 
part?

Further, we used to use an old copy of nutch, of course now the hadoop part of 
nutch is its own jar file, so I upgraded the nutch jars too.  We were using a 
few things within the nutch project that seem to of gone away:

net.sf incarnation of the snowball stemmer (I fixed this by pulling directly 
the source from the author).
language identificationany idea where it went?
carrot2 clusteringany idea where that went?

Thanks in advance.

Chris


RE: Couple of basic hdfs starter issues

2008-06-07 Thread chris collins
I should update this to stupidity on my part (though the hidden shell execution 
within the client thats error gets masked is somewhat fickle).  Of course if I 
dont start the thing up via the ide, but from the command line it goes past 
this problem (security issue, but that one is probably a more obvious thing).  

Still if anyone has an idea what happened to language id and the carrot2 stuff 
inside nutch that would be appreciated.

C


-Original Message-
From: chris collins [mailto:[EMAIL PROTECTED]
Sent: Sat 6/7/2008 10:54 AM
To: core-user@hadoop.apache.org
Subject: Couple of basic hdfs starter issues
 
Sorry in advance if these "challenges" are covered in a document somewhere.

I have setup hadoop on a centos 64 bit Linux box.  I have verified that it is 
up and running only through seeing the java processes running and that I can 
access it from the admin ui.

hadoop version is 1.7.0 but I also tried 1.6.4 for the following issue:

>From a mac osx box using java 1.5 I am trying to run the following:

String home = "hdfs://linuxbox:9000";
URI uri = new URI(home);
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(uri, conf);

The call to FileSystem.get throws an IOException stating that there is a login 
error with message "whoami".

When I single step through the code there is an attempt to figure out what user 
is running this process by creating a processbuilder with "whoami".  This fails 
with a "not found" error.  I believe this is because you have to have a fully 
qualified path for processbuilder on the mac???

I also verified that my hadoop-default.xml and hadoop-site.xml is infact found 
in the classpath.

All this is being attempted via a debug session in intellij ide.

Any ideas on what I am doing wrong, I am sure its a configuration blunder on my 
part?

Further, we used to use an old copy of nutch, of course now the hadoop part of 
nutch is its own jar file, so I upgraded the nutch jars too.  We were using a 
few things within the nutch project that seem to of gone away:

net.sf incarnation of the snowball stemmer (I fixed this by pulling directly 
the source from the author).
language identificationany idea where it went?
carrot2 clusteringany idea where that went?

Thanks in advance.

Chris



Re: client connect as different username?

2008-06-11 Thread Chris Collins
The finer point to this is that in development you may be logged in as  
user x and have a shared hdfs instance that a number of people are  
using.  In that mode its not practical to sudo as you have all your  
development tools setup for userx.  hdfs is setup with a single user,  
what is the procedure to add users to that hdfs instance?  It has to  
support it surely?  Its really not obvious, looking in the hdfs docs  
that come with the distro nothing springs out.  the hadoop command  
line tool doesnt have anything that vaguely looks like a way to create  
a user.


Help is greatly appreciated.  I am sure its somewhere so blindingly  
obvious.


How are other people doing other that sudoing to one single user name?

Thanks

ChRiS


On Jun 11, 2008, at 5:11 PM, [EMAIL PROTECTED] wrote:

The best way is to use sudo command to execute hadoop client.  Does  
it work for you?


Nicholas


- Original Message 

From: Bob Remeika <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, June 11, 2008 12:56:14 PM
Subject: client connect as different username?

Apologies if this is an RTM response, but I looked and wasn't able  
to find
anything concrete.  Is it possible to connect to HDFS via the HDFS  
client

under a different username than I am currently logged in as?

Here is our situation, I am user bobr on the client machine.  I  
need to add
something to the HDFS cluster as the user "companyuser".  Is this  
possible

with the current set of APIs or do I have to upload and "chown"?

Thanks,
Bob






Re: client connect as different username?

2008-06-11 Thread Chris Collins
We know whomai is called, thanks, I found out painfully the first day  
I played with this because in dev, my ide is started not from a  
shell.  Therefor the path is not inherited to include /usr/bin.  Hdfs  
client hides the actual fact that ProcessBuilder barfs with a file not  
found with a "login exception", "whoami".  Not as clear as I would of  
liked :-}|


You are referring to creating a directory in hdfs?  Because if I am  
user chris and the hdfs only has user foo, then I cant create a  
directory because I dont have perms, infact I cant even connect.  I  
believe another emailer holds the answer which was blindly dumb on my  
part for not trying, that of adding a user in unix and creating a  
group that those users belong to.


Thanks

Chris
On Jun 11, 2008, at 5:36 PM, Allen Wittenauer wrote:





On 6/11/08 5:17 PM, "Chris Collins" <[EMAIL PROTECTED]> wrote:

The finer point to this is that in development you may be logged in  
as

user x and have a shared hdfs instance that a number of people are
using.  In that mode its not practical to sudo as you have all your
development tools setup for userx.  hdfs is setup with a single user,
what is the procedure to add users to that hdfs instance?  It has to
support it surely?  Its really not obvious, looking in the hdfs docs
that come with the distro nothing springs out.  the hadoop command
line tool doesnt have anything that vaguely looks like a way to  
create

a user.


   User information is sent from the client.  The code literally  
does a

'whoami' and 'groups' and sends that information to the server.

   Shared data should be handled just like you would in UNIX:

   - create a directory
   - set permissions to be insecure
   - go crazy







Re: client connect as different username?

2008-06-11 Thread Chris Collins
Thanks Doug, should this be added to the permissions doc or to the  
faq?  See you in Sonoma.


C
On Jun 11, 2008, at 9:15 PM, Doug Cutting wrote:


Chris Collins wrote:
You are referring to creating a directory in hdfs?  Because if I am  
user chris and the hdfs only has user foo, then I cant create a  
directory because I dont have perms, infact I cant even connect.


Today, users and groups are declared by the client.  The namenode  
only records and checks against user and group names provided by the  
client.  So if someone named "foo" writes a file, then that file is  
owned by someone named "foo" and anyone named "foo" is the owner of  
that file. No "foo" account need exist on the namenode.


The one (important) exception is the "superuser".  Whatever user  
name starts the namenode is the superuser for that filesystem.  And  
if "/" is not world writable, a new filesystem will not contain a  
home directory (or anywhere else) writable by other users.  So, in a  
multiuser Hadoop installation, the superuser needs to create home  
directories and project directories for other users and set their  
protections accordingly before other users can do anything.  Perhaps  
this is what you've run into?


Doug




Re: Programatically initializing and starting HDFS cluster

2008-06-12 Thread Chris Collins
I am also interested about this option, since I will probably be  
hacking at such a thing in the next few weeks.


I am also curious if you can run MR jobs within process rather than  
launching each time.  The scenario is when initialization takes just  
way too long for a map reduce shard to be executed in this model.  For  
example, say you are trying to compute the top n terms within a set of  
documents where top n is those top rarest terms in some model  corpus,  
perhaps you have a df index, or perhaps you have a huge nlp engine  
thats used for entity extraction, any of these assume  a chunk of  
memory and  a chunk of time to init each pass.


Here of course you really would need not only to specify the job, but  
somehow constrain the candidate nodes this can run on based upon their  
ability to run this.


C

On Jun 12, 2008, at 2:02 AM, Robert Krüger wrote:



Hi,

for our developers I would like to write a few lines of Java code  
that, given a base directory, sets up an HDFS filesystem,  
initializes it, if it is not there yet and then starts the  
service(s) in process. This is to run on each developer's machine,  
probably within a tomcat instance. I don't want to do this (if I  
don't have to) in a bunch of shell scripts.


Could anyone point to code samples that do similar things or give  
any other hints that make this easier than to look at what the  
Command line tools do and reverse engineer it from there?


Thanks in advance,

Robert




Re: client connect as different username?

2008-06-12 Thread Chris Collins
Thanks Nicolas, I read it yet again (ok, only the third time).  Yes it  
talks of whoami, I actually knew that from single stepping the client  
too, I was still stuck.  It mentioned the posix model, kinda guessed  
that also from the javadocs.  From Dougs note it clearly states that  
"No foo account need to exist on the namenode" and that the only  
exception is the user that started the server.  I didnt get that  
clarity from the perms doc.  Perhaps an example for the case where  
there are users other than that that started the serverI would of  
thought this was a common one.  In our office we dumped this on a  
bunch of linux boxes that all share the same username, but all our  
developers are using macs with their own user name and they dont  
expect to have their own user on the linux boxes (cause we are lazy  
that way).


For instance, that all it requires is for me to create the ability for  
say a mac user with login of bob to access things under /bob is for me  
to go in as the super user and do something like:


hadoop dfs -mkdir /bob
hadoop dfs -chown bob /bob

where bob literally doesnt exist on the hdfs box and was not mentioned  
prior to those two commands.




On Jun 11, 2008, at 10:00 PM, [EMAIL PROTECTED] wrote:


This information can be found in 
http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html
Nicholas


- Original Message ----

From: Chris Collins <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, June 11, 2008 9:31:18 PM
Subject: Re: client connect as different username?

Thanks Doug, should this be added to the permissions doc or to the
faq?  See you in Sonoma.

C
On Jun 11, 2008, at 9:15 PM, Doug Cutting wrote:


Chris Collins wrote:

You are referring to creating a directory in hdfs?  Because if I am
user chris and the hdfs only has user foo, then I cant create a
directory because I dont have perms, infact I cant even connect.


Today, users and groups are declared by the client.  The namenode
only records and checks against user and group names provided by the
client.  So if someone named "foo" writes a file, then that file is
owned by someone named "foo" and anyone named "foo" is the owner of
that file. No "foo" account need exist on the namenode.

The one (important) exception is the "superuser".  Whatever user
name starts the namenode is the superuser for that filesystem.  And
if "/" is not world writable, a new filesystem will not contain a
home directory (or anywhere else) writable by other users.  So, in a
multiuser Hadoop installation, the superuser needs to create home
directories and project directories for other users and set their
protections accordingly before other users can do anything.  Perhaps
this is what you've run into?

Doug






Re: Internet-Based Secure Clustered FS?

2008-06-18 Thread Chris Collins
Have you considered Amazon S3?  I dont know how secure your  
requirements are.  There are lots of companies using this for just  
offsite data storage and also with EC2.



C

On Jun 17, 2008, at 6:48 PM, Kenneth Miller wrote:


All,

  I'm looking for a solution that would allow me to securely use  
VPSs (hosted VMs) or hosted dedicated servers as nodes in a  
distributed file system. My bandwidth/speed requirements aren't  
high, space requirements are potentially huge and ever growing,  
superb security is a must, but I really don't want to worry about  
hosting the DFS in-house. Is there any solution that's capable of  
this and/or is there anyone currently doing this?


Regards,
Kenneth Miller




Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Chris Collins

Sleepycat has a java edition:

http://www.oracle.com/technology/products/berkeley-db/index.html

I has an "interesting" open source license.  If you dont need to ship  
it on an install disk your probably good to go with that too.


you could also consider Derby.

C
On Nov 1, 2008, at 7:49 PM, lamfeeling wrote:

Consider Embeded Database? Berkeley DB, written in C++, and have  
interface for many languages.






在2008-11-02?10:15:22,"Zhou,?Yunqing"?<[EMAIL PROTECTED]>?写道:
The?project?I?focused?on?has?many?modules?written?in?different? 
languages

(several?modules?are?hadoop?jobs).
So?I'd?like?to?utilize?a?common?record?based?data?file?format?for? 
data

exchange.
XML?is?not?efficient?for?appending?new?records.
SequenceFile?seems?not?having?API?of?other?languages?except?Java.
Protocol?Buffers'?hadoop?API?seems?under?development.
any?recommendation?for?this?

Thanks




Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Chris Collins
Consider talking to Doug Cutting.  He is playing with the idea of a  
variant of JSON, I am sure he would love your help.  Specifically he  
is looking at a coding scheme that is easy to read, does not duplicate  
key names per record and supports file splits.


C
On Nov 1, 2008, at 8:20 PM, Zhou, Yunqing wrote:


embedded database cannot handle large-scale data, not very efficient
I have about 1 billion records.
these records should be passed through some modules.
I mean a data exchange format similar to XML but more flexible and
efficient.

On Sun, Nov 2, 2008 at 10:49 AM, lamfeeling <[EMAIL PROTECTED]>  
wrote:


Consider Embeded Database? Berkeley DB, written in C++, and have  
interface

for many languages.





在2008-11-02?10:15:22,"Zhou,?Yunqing"?<[EMAIL PROTECTED]>?写道:
The?project?I?focused?on?has?many?modules?written?in?different? 
languages

(several?modules?are?hadoop?jobs).
So?I'd?like?to?utilize?a?common?record?based?data?file?format?for? 
data

exchange.
XML?is?not?efficient?for?appending?new?records.
SequenceFile?seems?not?having?API?of?other?languages?except?Java.
Protocol?Buffers'?hadoop?API?seems?under?development.
any?recommendation?for?this?

Thanks






Re: Hadoop datanode crashed - SIGBUS

2008-12-01 Thread Chris Collins
Was there anything mentioned as part of the tombstone message about  
"problematic frame"?  What java are you using?  There are a few  
reasons for SIGBUS errors, one is illegal address alignment, but from  
java thats very unlikelythere were some issues with the native zip  
library in older vm's.  As Brian pointed out, sometimes this points to  
a hw issue.


C
On Dec 1, 2008, at 1:32 PM, Sagar Naik wrote:




Brian Bockelman wrote:

Hardware/memory problems?

I m not sure.


SIGBUS is relatively rare; it sometimes indicates a hardware error  
in the memory system, depending on your arch.



*uname -a : *
Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST  
2006 i686 i686 i386 GNU/Linux

*top's top*
Cpu(s):  0.1% us,  1.1% sy,  0.0% ni, 98.0% id,  0.8% wa,  0.0% hi,   
0.0% si
Mem:   8288280k total,  1575680k used,  6712600k free, 5392k  
buffers
Swap: 16386292k total,   68k used, 16386224k free,   522408k  
cached


8 core , xeon  2GHz


Brian

On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote:


Couple of the datanodes crashed with the following error
The /tmp is 15% occupied

#
# An unexpected error has been detected by Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408
#
[Too many errors, abort]

Pl suggest how should I go to debug this particular problem


-Sagar




Thanks to Brian

-Sagar




Re: Hadoop datanode crashed - SIGBUS

2008-12-01 Thread Chris Collins
I had some pretty bad issues with leaks in _07.   _10 btw has a lot of  
bug fixes.  I dont know it would fix this problem.  As for flags I  
wouldnt know.  One thing you could try is to try and match the memory  
region that the program counter matches.  If you use jstack or jmap,  
cant remember which, it will give you a dump of all the libraries and  
their memory address range.  From that you may see if the PCounter  
matches anything interesting.


Other than that I would go with Brians recommendations.

C
On Dec 1, 2008, at 1:59 PM, Sagar Naik wrote:



hi,
I dont have additional information on it. If u know any other flag  
tht I need to turn on , pl do tell me . The flags tht are currently  
on  are " -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParallelGC - 
Dcom.sun.management.jmxremote"

But this is what is listed in stdout (datanode.out) file

Java version :
java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
Java HotSpot(TM) Server VM (build 10.0-b23, mixed mode)


I will try to stress test the memory.

-Sagar

Chris Collins wrote:
Was there anything mentioned as part of the tombstone message about  
"problematic frame"?  What java are you using?  There are a few  
reasons for SIGBUS errors, one is illegal address alignment, but  
from java thats very unlikelythere were some issues with the  
native zip library in older vm's.  As Brian pointed out, sometimes  
this points to a hw issue.


C
On Dec 1, 2008, at 1:32 PM, Sagar Naik wrote:




Brian Bockelman wrote:

Hardware/memory problems?

I m not sure.


SIGBUS is relatively rare; it sometimes indicates a hardware  
error in the memory system, depending on your arch.



*uname -a : *
Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST  
2006 i686 i686 i386 GNU/Linux

*top's top*
Cpu(s):  0.1% us,  1.1% sy,  0.0% ni, 98.0% id,  0.8% wa,  0.0%  
hi,  0.0% si
Mem:   8288280k total,  1575680k used,  6712600k free, 5392k  
buffers
Swap: 16386292k total,   68k used, 16386224k free,   522408k  
cached


8 core , xeon  2GHz


Brian

On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote:


Couple of the datanodes crashed with the following error
The /tmp is 15% occupied

#
# An unexpected error has been detected by Java Runtime  
Environment:

#
#  SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408
#
[Too many errors, abort]

Pl suggest how should I go to debug this particular problem


-Sagar




Thanks to Brian

-Sagar








Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Chris Collins
a couple of years back we did a lot of experimentation between sun's  
vm and jrocket.  We had initially assumed that jrocket was going to  
scream since thats what the press were saying.  In short, what we  
discovered was that certain jdk library usage was a little bit faster  
with jrocket, but for core vm performance such as synchronization,  
primitive operations the sun vm out performed.  We were not taking  
account of startup time, just raw code execution.  As I said, this was  
a couple of years back so things may of changed.


C
On May 7, 2009, at 2:17 AM, Grace wrote:

I am running  the test on 0.18.1 and 0.19.1. Both versions have the  
same
issue with JRockit JVM.  It is for the example sort job, to sort 20G  
data on

1+2 nodes.

Following is the result(version 0.18.1). The sort job running with  
JRockit

JVM took 260 secs more than that with Sun JVM.
---
|| JVM  || Completion Time ||
---
|| JRockit  ||  786,315 msec||
|| Sun   ||  526,602 msec ||
---

Furthermore, under 0.19.1 version, I have set the reusing JVM  
parameter as

-1. It seems no improvement for JRockit JVM.

On Thu, May 7, 2009 at 4:32 PM, JQ Hadoop  wrote:

I believe Jrockit JVM have slightly higer startup time than the SUN  
JVM;

but
that should not make a lot of difference, especially if JVMs are  
reused in

0.19.

Which Hadoop version are you using?  What Hadoop job are you  
running? And

what performance do you get?

Thanks,
JQ

-Original Message-
From: Grace
Sent: Wednesday, May 06, 2009 1:07 PM
To: core-user@hadoop.apache.org
Subject: Is there any performance issue with Jrockit JVM for Hadoop

Hi all,
This is Grace.
I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all  
the

same
Java options and configuration as Sun JVM.  However it is very  
strange that
the performance using Jrockit JVM is poorer than the one using Sun,  
such as

the map stage became slower.
Has anyone encountered the similar problem? Could you please give  
some

advise about it? Thanks a lot.

Regards,
Grace





Re: Huge DataNode Virtual Memory Usage

2009-05-08 Thread Chris Collins
Stefan, there was a nasty memory leak in in 1.6.x before 1.6 10.  It  
manifested itself during major GC.  We saw this on linux and solaris  
and dramatically improved with an upgrade.


C
On May 8, 2009, at 6:12 PM, Stefan Will wrote:


Hi,

I just ran into something rather scary: One of my datanode processes  
that
I’m running with –Xmx256M, and a maximum number of Xceiver threads  
of 4095
had a virtual memory size of over 7GB (!). I know that the VM size  
on Linux
isn’t necessarily equal to the actual memory used, but I wouldn’t  
expect it
to be an order of magnitude higher either. I ran pmap on the  
process, and it
showed around 1000 thread stack blocks with roughly 1MB each (which  
is the
default size on the 64bit JDK). The largest block was 3GB in size  
which I

can’t figure out what it is for.

Does anyone have any insights into this ? Anything that can be done to
prevent this other than to restart the DFS regularly ?

-- Stefan




Re: Huge DataNode Virtual Memory Usage

2009-05-09 Thread Chris Collins

I think it may of been 6676016:

http://java.sun.com/javase/6/webnotes/6u10.html

We were able to repro at the time this through heavy lucene indexing +  
our internal document pre-processing logic that churned a lot of  
objects.  We have still experience similar issues with 10 but much  
rarer.  Maybe going to 13 may shed some light, you could be tickling  
another similar bug but I didnt see anything obvious.


C


On May 9, 2009, at 12:30 AM, Stefan Will wrote:


Chris,

Thanks for the tip ... However I'm already running 1.6_10:

java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

Do you know of a specific bug # in the JDK bug database that  
addresses this

?

Cheers,
Stefan



From: Chris Collins 
Reply-To: 
Date: Fri, 8 May 2009 20:34:21 -0700
To: "core-user@hadoop.apache.org" 
Subject: Re: Huge DataNode Virtual Memory Usage

Stefan, there was a nasty memory leak in in 1.6.x before 1.6 10.  It
manifested itself during major GC.  We saw this on linux and solaris
and dramatically improved with an upgrade.

C
On May 8, 2009, at 6:12 PM, Stefan Will wrote:


Hi,

I just ran into something rather scary: One of my datanode processes
that
I’m running with –Xmx256M, and a maximum number of Xceiver threads
of 4095
had a virtual memory size of over 7GB (!). I know that the VM size
on Linux
isn’t necessarily equal to the actual memory used, but I wouldn’t
expect it
to be an order of magnitude higher either. I ran pmap on the
process, and it
showed around 1000 thread stack blocks with roughly 1MB each (which
is the
default size on the 64bit JDK). The largest block was 3GB in size
which I
can’t figure out what it is for.

Does anyone have any insights into this ? Anything that can be  
done to

prevent this other than to restart the DFS regularly ?

-- Stefan