Re: Try to mount HDFS

2010-04-22 Thread Christian Baun
Hi,

When adding the port information inside core-site.xml, the problem remains:


fs.default.name

hdfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020
true


# ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 
/mnt/hdfs/ 
port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
fuse-dfs didn't recognize /mnt/hdfs/,-2

# ls /mnt/hdfs
ls: cannot access /mnt/hdfs/®1: No such file or directory

Best Regards,
   Christian


Am Freitag, 23. April 2010 schrieb Christian Baun:
> Hi Brian,
> 
> this is inside my core-site.xml 
> 
> 
>   
>   fs.default.name
>   hdfs://ec2-75-101-210-65.compute-1.amazonaws.com/
>   true
>   
>   
>   hadoop.tmp.dir
>   /mnt
>   A base for other temporary 
> directories.
>   
> 
> 
> Do I need to give the port here? 
> 
> this is inside my hdfs-site.xml
> 
> 
>   
>   dfs.name.dir
>   ${hadoop.tmp.dir}/dfs/name
>   true
>   
>   
>   dfs.data.dir
>   ${hadoop.tmp.dir}/dfs/data
> 
> 
>   
>   fs.checkpoint.dir
>   ${hadoop.tmp.dir}/dfs/namesecondary
>   true
>   true
>   
> 
> 
> These directories do all exist
> 
> # ls -l /mnt/dfs/
> total 12
> drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 data
> drwxr-xr-x 4 hadoop hadoop 4096 2010-04-23 05:17 name
> drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 namesecondary
> 
> I don't have the config file hadoop-site.xml in /etc/...
> In the source directory of hadoop I have a hadoop-site.xml but with this 
> information
> 
> 
> 
> 
> 
> 
> 
> Best Regards,
>Christian 
> 
> 
> 
> Am Freitag, 23. April 2010 schrieb Brian Bockelman:
> > Hey Christian,
> > 
> > I've run into this before.
> > 
> > Make sure that the hostname/port you give to fuse is EXACTLY the same as 
> > listed in hadoop-site.xml.
> > 
> > If these aren't the same text string (including the ":8020"), then you get 
> > those sort of issues.
> > 
> > Brian
> > 
> > On Apr 22, 2010, at 5:00 AM, Christian Baun wrote:
> > 
> > > Dear All,
> > > 
> > > I want to test HDFS inside Amazon EC2.
> > > 
> > > Two Ubuntu instances are running inside EC2. 
> > > One server is namenode and jobtracker. The other server is the datanode.
> > > Cloudera (hadoop-0.20) is installed and running.
> > > 
> > > Now, I want to mount HDFS.
> > > I tried to install contrib/fuse-dfs as described here:
> > > http://wiki.apache.org/hadoop/MountableHDFS
> > > 
> > > The compilation worked via:
> > > 
> > > # ant compile-c++-libhdfs -Dlibhdfs=1
> > > # ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/ 
> > > -Dforrest.home=/home/ubuntu/apache-forrest-0.8/
> > > # ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1
> > > 
> > > But now, when I try to mount the filesystem:
> > > 
> > > # ./fuse_dfs_wrapper.sh 
> > > dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/ -d
> > > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > > fuse-dfs didn't recognize /mnt/hdfs/,-2
> > > fuse-dfs ignoring option -d
> > > FUSE library version: 2.8.1
> > > nullpath_ok: 0
> > > unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
> > > INIT: 7.13
> > > flags=0x007b
> > > max_readahead=0x0002
> > >   INIT: 7.12
> > >   flags=0x0011
> > >   max_readahead=0x0002
> > >   max_write=0x0002
> > >   unique: 1, success, outsize: 40
> > > 
> > > 
> > > # ./fuse_dfs_wrapper.sh 
> > > dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/
> > > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > > fuse-dfs didn't recognize /mnt/hdfs/,-2
> > > 
> > > # ls /mnt/hdfs/
> > > ls: reading directory /mnt/hdfs/: Input/output error
> > > # ls /mnt/hdfs/
> > > ls: cannot access /mnt/hdfs/o¢: No such file or directory
> > > o???
> > > # ls /mnt/hdfs/
> > > ls: reading directory /mnt/hdfs/: Input/output error
> > > # ls /mnt/hdfs/
> > > ls: cannot access /mnt/hdfs/`á›Óÿ: No such file or directory
> > > `?
> > > # ls /mnt/hdfs/
> > > ls: reading directory /mnt/hdfs/: Input/output error
> > > ...
> > > 
> > > 
> > > What can I do at this point?
> > > 
> > > Thanks in advance
> > > Christian
> > 
> > 
> 
> 




Re: Try to mount HDFS

2010-04-22 Thread Christian Baun
Hi paul,

when I use port 9000 instead of 8020, the problem still exists.

# ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:9000 
/mnt/hdfs/ 
port=9000,server=ec2-75-101-210-65.compute-1.amazonaws.com
fuse-dfs didn't recognize /mnt/hdfs/,-2

# ls -l /mnt/hdfs
ls: cannot access /mnt/hdfs: Input/output error

Best Regards,
   Christian 


Am Freitag, 23. April 2010 schrieb paul:
> Just a heads up on this, we've run into problems when trying to use fuse to
> mount dfs running on port :8020.  However, it works fine when we ran it on
> :9000.
> 
> 
> -paul
> 
> 
> On Thu, Apr 22, 2010 at 7:59 PM, Brian Bockelman wrote:
> 
> > Hey Christian,
> >
> > I've run into this before.
> >
> > Make sure that the hostname/port you give to fuse is EXACTLY the same as
> > listed in hadoop-site.xml.
> >
> > If these aren't the same text string (including the ":8020"), then you get
> > those sort of issues.
> >
> > Brian
> >
> > On Apr 22, 2010, at 5:00 AM, Christian Baun wrote:
> >
> > > Dear All,
> > >
> > > I want to test HDFS inside Amazon EC2.
> > >
> > > Two Ubuntu instances are running inside EC2.
> > > One server is namenode and jobtracker. The other server is the datanode.
> > > Cloudera (hadoop-0.20) is installed and running.
> > >
> > > Now, I want to mount HDFS.
> > > I tried to install contrib/fuse-dfs as described here:
> > > http://wiki.apache.org/hadoop/MountableHDFS
> > >
> > > The compilation worked via:
> > >
> > > # ant compile-c++-libhdfs -Dlibhdfs=1
> > > # ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/
> > -Dforrest.home=/home/ubuntu/apache-forrest-0.8/
> > > # ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1
> > >
> > > But now, when I try to mount the filesystem:
> > >
> > > # ./fuse_dfs_wrapper.sh dfs://
> > ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/ -d
> > > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > > fuse-dfs didn't recognize /mnt/hdfs/,-2
> > > fuse-dfs ignoring option -d
> > > FUSE library version: 2.8.1
> > > nullpath_ok: 0
> > > unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
> > > INIT: 7.13
> > > flags=0x007b
> > > max_readahead=0x0002
> > >   INIT: 7.12
> > >   flags=0x0011
> > >   max_readahead=0x0002
> > >   max_write=0x0002
> > >   unique: 1, success, outsize: 40
> > >
> > >
> > > # ./fuse_dfs_wrapper.sh dfs://
> > ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/
> > > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > > fuse-dfs didn't recognize /mnt/hdfs/,-2
> > >
> > > # ls /mnt/hdfs/
> > > ls: reading directory /mnt/hdfs/: Input/output error
> > > # ls /mnt/hdfs/
> > > ls: cannot access /mnt/hdfs/o¢  : No such file or directory
> > > o???
> > > # ls /mnt/hdfs/
> > > ls: reading directory /mnt/hdfs/: Input/output error
> > > # ls /mnt/hdfs/
> > > ls: cannot access /mnt/hdfs/`á›Óÿ : No such file or directory
> > > `?
> > > # ls /mnt/hdfs/
> > > ls: reading directory /mnt/hdfs/: Input/output error
> > > ...
> > >
> > >
> > > What can I do at this point?
> > >
> > > Thanks in advance
> > > Christian
> >
> >
> 




Re: Try to mount HDFS

2010-04-22 Thread Christian Baun
Hi Brian,

this is inside my core-site.xml 



fs.default.name
hdfs://ec2-75-101-210-65.compute-1.amazonaws.com/
true


hadoop.tmp.dir
/mnt
A base for other temporary 
directories.



Do I need to give the port here? 

this is inside my hdfs-site.xml



dfs.name.dir
${hadoop.tmp.dir}/dfs/name
true


dfs.data.dir
${hadoop.tmp.dir}/dfs/data



fs.checkpoint.dir
${hadoop.tmp.dir}/dfs/namesecondary
true
true



These directories do all exist

# ls -l /mnt/dfs/
total 12
drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 data
drwxr-xr-x 4 hadoop hadoop 4096 2010-04-23 05:17 name
drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 namesecondary

I don't have the config file hadoop-site.xml in /etc/...
In the source directory of hadoop I have a hadoop-site.xml but with this 
information







Best Regards,
   Christian 



Am Freitag, 23. April 2010 schrieb Brian Bockelman:
> Hey Christian,
> 
> I've run into this before.
> 
> Make sure that the hostname/port you give to fuse is EXACTLY the same as 
> listed in hadoop-site.xml.
> 
> If these aren't the same text string (including the ":8020"), then you get 
> those sort of issues.
> 
> Brian
> 
> On Apr 22, 2010, at 5:00 AM, Christian Baun wrote:
> 
> > Dear All,
> > 
> > I want to test HDFS inside Amazon EC2.
> > 
> > Two Ubuntu instances are running inside EC2. 
> > One server is namenode and jobtracker. The other server is the datanode.
> > Cloudera (hadoop-0.20) is installed and running.
> > 
> > Now, I want to mount HDFS.
> > I tried to install contrib/fuse-dfs as described here:
> > http://wiki.apache.org/hadoop/MountableHDFS
> > 
> > The compilation worked via:
> > 
> > # ant compile-c++-libhdfs -Dlibhdfs=1
> > # ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/ 
> > -Dforrest.home=/home/ubuntu/apache-forrest-0.8/
> > # ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1
> > 
> > But now, when I try to mount the filesystem:
> > 
> > # ./fuse_dfs_wrapper.sh 
> > dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/ -d
> > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > fuse-dfs didn't recognize /mnt/hdfs/,-2
> > fuse-dfs ignoring option -d
> > FUSE library version: 2.8.1
> > nullpath_ok: 0
> > unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
> > INIT: 7.13
> > flags=0x007b
> > max_readahead=0x0002
> >   INIT: 7.12
> >   flags=0x0011
> >   max_readahead=0x0002
> >   max_write=0x0002
> >   unique: 1, success, outsize: 40
> > 
> > 
> > # ./fuse_dfs_wrapper.sh 
> > dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/
> > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > fuse-dfs didn't recognize /mnt/hdfs/,-2
> > 
> > # ls /mnt/hdfs/
> > ls: reading directory /mnt/hdfs/: Input/output error
> > # ls /mnt/hdfs/
> > ls: cannot access /mnt/hdfs/o¢: No such file or directory
> > o???
> > # ls /mnt/hdfs/
> > ls: reading directory /mnt/hdfs/: Input/output error
> > # ls /mnt/hdfs/
> > ls: cannot access /mnt/hdfs/`á›Óÿ: No such file or directory
> > `?
> > # ls /mnt/hdfs/
> > ls: reading directory /mnt/hdfs/: Input/output error
> > ...
> > 
> > 
> > What can I do at this point?
> > 
> > Thanks in advance
> > Christian
> 
> 




Re: separate JVM flags for map and reduce tasks

2010-04-22 Thread Hemanth Yamijala
Vasilis,

> I 'd like to pass different JVM options for map tasks and different
> ones for reduce tasks. I think it should be straightforward to add
> mapred.mapchild.java.opts, mapred.reducechild.java.opts to my
> conf/mapred-site.xml and process the new options accordingly in
> src/mapred/org/apache/mapreduce/TaskRunner.java . Let me know if you
> think it's more involved than what I described.

In trunk, (I haven't checked in earlier versions), there are already
options such as mapreduce.map.java.opts and
mapreduce.reduce.java.opts. Strangely, these are not documented in
mapred-default.xml, though the option mapred.child.java.opts is
deprecated in favor of the other two options. Please refer to
MAPREDUCE-478 for details.

>
> My question is: if mapred.job.reuse.jvm.num.tasks is set to -1 (always
> reuse), can the same JVM be re-used for different types of tasks? So
> the same JVM being used e.g. first by a map task and then used by
> reduce task. I am assuming this is definitely possible, though I
> haven't verified in the code.

Nope. JVMs are not reused across types. o.a.h.mapred.JvmManager has
the relevant information. There's a JvmManagerForType inner class to
which all reuse related calls are delegated and that is per type. In
particular, launchJVM which is the basic method that triggers a reuse
or spawns a new JVM, operates based on the task type.

> So , if one wants to pass different jvm options to map tasks and
> reduce tasks, perhaps jobs.reuse.jvm.num.task should be set to 1
> (never reuse) ?
>

Given the above, this is not necessary. You can reuse JVMs and pass
separate parameters to the respective task types.


RE: Using external library in MapReduce jobs

2010-04-22 Thread Michael Segel



> Date: Thu, 22 Apr 2010 17:30:13 -0700
> Subject: Re: Using external library in MapReduce jobs
> From: ale...@cloudera.com
> To: common-user@hadoop.apache.org
> 
> Sure, you need to place them into $HADOOP_HOME/lib directory on each server
> in the cluster and they will be picked up on the next restart.
> 
> -- Alex K
> 

While this works, I wouldn't recommend it.

You have to look at it this way... Your external m/r java libs are job centric. 
So every time you want to add jobs that require new external libraries you have 
to 'bounce' your cloud after pushing the the jars. Then you also have the issue 
of java class collisions if the cloud has one version of the same jar you're 
using. (We've had this happen to us already.)

If you're just testing for a proof of concept, its one thing, but after the 
proof, you'll need to determine how to correctly push the jars out to each node.

In a production environment, constantly bouncing clouds for each new job isn't 
really a good idea.

HTH

-Mike
  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: Using external library in MapReduce jobs

2010-04-22 Thread Farhan Husain
I tried the addFileToClassPath method but it did not work for me, I don't
know why.

On Thu, Apr 22, 2010 at 7:30 PM, Utkarsh Agarwal wrote:

> maybe this will help "DistributedCache"
>
>
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)
>
> -Utkarsh.
>
> On Thu, Apr 22, 2010 at 5:18 PM, Farhan Husain  >wrote:
>
> > Hello Alex,
> >
> > Is there any way to distribute the java library jar files to all nodes
> like
> > the way for the native libraries?
> >
> > Thanks,
> > Farhan
> >
> > On Thu, Apr 22, 2010 at 6:28 PM, Alex Kozlov 
> wrote:
> >
> > > Hi Farhan,
> > >
> > > Are you talking about java libs (jar) or native libs (.so, etc)?
> > >
> > > *Jars:*
> > >
> > > You can just jar it with your jar file, just put it in a lib
> subdirectory
> > > of
> > > your jar root directory
> > >
> > > *Native:
> > >
> > > *Put them into $HADOOP_HOME/lib/native/$PLATFORM/ on each node in the
> > > cluster
> > >
> > > where PLATFORM is the string returned by `hadoop
> > > org.apache.hadoop.util.PlatformName`
> > >
> > > There is a way to distribute native libs runtime, but it's more
> involved.
> > >
> > > Alex K
> > >
> > > On Thu, Apr 22, 2010 at 4:04 PM, Raghava Mutharaju <
> > > m.vijayaragh...@gmail.com> wrote:
> > >
> > > > Hello Farhan,
> > > >
> > > >I use an external library and I run the MR job from command
> > line.
> > > So
> > > > I specify it in -libjars as follows
> > > >
> > > > hadoop jar (my jar) (my class) -libjars (external jar) (args for my
> > > class)
> > > >
> > > > Raghava.
> > > >
> > > > On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain <
> > > farhan.hus...@csebuet.org
> > > > >wrote:
> > > >
> > > > > Hello guys,
> > > > >
> > > > > Can you please tell me how I can use external libraries which my
> jobs
> > > > link
> > > > > to in a MapReduce job? I added the following lines in
> mapred-site.xml
> > > in
> > > > > all
> > > > > my nodes and put the external library jars in the specified
> directory
> > > but
> > > > I
> > > > > am getting ClassNotFoundException:
> > > > >
> > > > > 
> > > > >  mapred.child.java.opts
> > > > >  -Xmx512m
> -Djava.library.path=/hadoop/Hadoop/userlibs
> > > > > 
> > > > >
> > > > > Am I doing anything wrong? Is there any other way to solve my
> > problem?
> > > > >
> > > > > Thanks,
> > > > > Farhan
> > > > >
> > > >
> > >
> >
>


Re: Using external library in MapReduce jobs

2010-04-22 Thread Farhan Husain
Thanks.

On Thu, Apr 22, 2010 at 7:30 PM, Alex Kozlov  wrote:

> Sure, you need to place them into $HADOOP_HOME/lib directory on each server
> in the cluster and they will be picked up on the next restart.
>
> -- Alex K
>
> On Thu, Apr 22, 2010 at 5:18 PM, Farhan Husain  >wrote:
>
> > Hello Alex,
> >
> > Is there any way to distribute the java library jar files to all nodes
> like
> > the way for the native libraries?
> >
> > Thanks,
> > Farhan
> >
> > On Thu, Apr 22, 2010 at 6:28 PM, Alex Kozlov 
> wrote:
> >
> > > Hi Farhan,
> > >
> > > Are you talking about java libs (jar) or native libs (.so, etc)?
> > >
> > > *Jars:*
> > >
> > > You can just jar it with your jar file, just put it in a lib
> subdirectory
> > > of
> > > your jar root directory
> > >
> > > *Native:
> > >
> > > *Put them into $HADOOP_HOME/lib/native/$PLATFORM/ on each node in the
> > > cluster
> > >
> > > where PLATFORM is the string returned by `hadoop
> > > org.apache.hadoop.util.PlatformName`
> > >
> > > There is a way to distribute native libs runtime, but it's more
> involved.
> > >
> > > Alex K
> > >
> > > On Thu, Apr 22, 2010 at 4:04 PM, Raghava Mutharaju <
> > > m.vijayaragh...@gmail.com> wrote:
> > >
> > > > Hello Farhan,
> > > >
> > > >I use an external library and I run the MR job from command
> > line.
> > > So
> > > > I specify it in -libjars as follows
> > > >
> > > > hadoop jar (my jar) (my class) -libjars (external jar) (args for my
> > > class)
> > > >
> > > > Raghava.
> > > >
> > > > On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain <
> > > farhan.hus...@csebuet.org
> > > > >wrote:
> > > >
> > > > > Hello guys,
> > > > >
> > > > > Can you please tell me how I can use external libraries which my
> jobs
> > > > link
> > > > > to in a MapReduce job? I added the following lines in
> mapred-site.xml
> > > in
> > > > > all
> > > > > my nodes and put the external library jars in the specified
> directory
> > > but
> > > > I
> > > > > am getting ClassNotFoundException:
> > > > >
> > > > > 
> > > > >  mapred.child.java.opts
> > > > >  -Xmx512m
> -Djava.library.path=/hadoop/Hadoop/userlibs
> > > > > 
> > > > >
> > > > > Am I doing anything wrong? Is there any other way to solve my
> > problem?
> > > > >
> > > > > Thanks,
> > > > > Farhan
> > > > >
> > > >
> > >
> >
>


Re: Using external library in MapReduce jobs

2010-04-22 Thread Utkarsh Agarwal
maybe this will help "DistributedCache"


http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)

-Utkarsh.

On Thu, Apr 22, 2010 at 5:18 PM, Farhan Husain wrote:

> Hello Alex,
>
> Is there any way to distribute the java library jar files to all nodes like
> the way for the native libraries?
>
> Thanks,
> Farhan
>
> On Thu, Apr 22, 2010 at 6:28 PM, Alex Kozlov  wrote:
>
> > Hi Farhan,
> >
> > Are you talking about java libs (jar) or native libs (.so, etc)?
> >
> > *Jars:*
> >
> > You can just jar it with your jar file, just put it in a lib subdirectory
> > of
> > your jar root directory
> >
> > *Native:
> >
> > *Put them into $HADOOP_HOME/lib/native/$PLATFORM/ on each node in the
> > cluster
> >
> > where PLATFORM is the string returned by `hadoop
> > org.apache.hadoop.util.PlatformName`
> >
> > There is a way to distribute native libs runtime, but it's more involved.
> >
> > Alex K
> >
> > On Thu, Apr 22, 2010 at 4:04 PM, Raghava Mutharaju <
> > m.vijayaragh...@gmail.com> wrote:
> >
> > > Hello Farhan,
> > >
> > >I use an external library and I run the MR job from command
> line.
> > So
> > > I specify it in -libjars as follows
> > >
> > > hadoop jar (my jar) (my class) -libjars (external jar) (args for my
> > class)
> > >
> > > Raghava.
> > >
> > > On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain <
> > farhan.hus...@csebuet.org
> > > >wrote:
> > >
> > > > Hello guys,
> > > >
> > > > Can you please tell me how I can use external libraries which my jobs
> > > link
> > > > to in a MapReduce job? I added the following lines in mapred-site.xml
> > in
> > > > all
> > > > my nodes and put the external library jars in the specified directory
> > but
> > > I
> > > > am getting ClassNotFoundException:
> > > >
> > > > 
> > > >  mapred.child.java.opts
> > > >  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs
> > > > 
> > > >
> > > > Am I doing anything wrong? Is there any other way to solve my
> problem?
> > > >
> > > > Thanks,
> > > > Farhan
> > > >
> > >
> >
>


Re: Using external library in MapReduce jobs

2010-04-22 Thread Alex Kozlov
Sure, you need to place them into $HADOOP_HOME/lib directory on each server
in the cluster and they will be picked up on the next restart.

-- Alex K

On Thu, Apr 22, 2010 at 5:18 PM, Farhan Husain wrote:

> Hello Alex,
>
> Is there any way to distribute the java library jar files to all nodes like
> the way for the native libraries?
>
> Thanks,
> Farhan
>
> On Thu, Apr 22, 2010 at 6:28 PM, Alex Kozlov  wrote:
>
> > Hi Farhan,
> >
> > Are you talking about java libs (jar) or native libs (.so, etc)?
> >
> > *Jars:*
> >
> > You can just jar it with your jar file, just put it in a lib subdirectory
> > of
> > your jar root directory
> >
> > *Native:
> >
> > *Put them into $HADOOP_HOME/lib/native/$PLATFORM/ on each node in the
> > cluster
> >
> > where PLATFORM is the string returned by `hadoop
> > org.apache.hadoop.util.PlatformName`
> >
> > There is a way to distribute native libs runtime, but it's more involved.
> >
> > Alex K
> >
> > On Thu, Apr 22, 2010 at 4:04 PM, Raghava Mutharaju <
> > m.vijayaragh...@gmail.com> wrote:
> >
> > > Hello Farhan,
> > >
> > >I use an external library and I run the MR job from command
> line.
> > So
> > > I specify it in -libjars as follows
> > >
> > > hadoop jar (my jar) (my class) -libjars (external jar) (args for my
> > class)
> > >
> > > Raghava.
> > >
> > > On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain <
> > farhan.hus...@csebuet.org
> > > >wrote:
> > >
> > > > Hello guys,
> > > >
> > > > Can you please tell me how I can use external libraries which my jobs
> > > link
> > > > to in a MapReduce job? I added the following lines in mapred-site.xml
> > in
> > > > all
> > > > my nodes and put the external library jars in the specified directory
> > but
> > > I
> > > > am getting ClassNotFoundException:
> > > >
> > > > 
> > > >  mapred.child.java.opts
> > > >  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs
> > > > 
> > > >
> > > > Am I doing anything wrong? Is there any other way to solve my
> problem?
> > > >
> > > > Thanks,
> > > > Farhan
> > > >
> > >
> >
>


Re: Using external library in MapReduce jobs

2010-04-22 Thread Farhan Husain
Hello Alex,

Is there any way to distribute the java library jar files to all nodes like
the way for the native libraries?

Thanks,
Farhan

On Thu, Apr 22, 2010 at 6:28 PM, Alex Kozlov  wrote:

> Hi Farhan,
>
> Are you talking about java libs (jar) or native libs (.so, etc)?
>
> *Jars:*
>
> You can just jar it with your jar file, just put it in a lib subdirectory
> of
> your jar root directory
>
> *Native:
>
> *Put them into $HADOOP_HOME/lib/native/$PLATFORM/ on each node in the
> cluster
>
> where PLATFORM is the string returned by `hadoop
> org.apache.hadoop.util.PlatformName`
>
> There is a way to distribute native libs runtime, but it's more involved.
>
> Alex K
>
> On Thu, Apr 22, 2010 at 4:04 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
> > Hello Farhan,
> >
> >I use an external library and I run the MR job from command line.
> So
> > I specify it in -libjars as follows
> >
> > hadoop jar (my jar) (my class) -libjars (external jar) (args for my
> class)
> >
> > Raghava.
> >
> > On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain <
> farhan.hus...@csebuet.org
> > >wrote:
> >
> > > Hello guys,
> > >
> > > Can you please tell me how I can use external libraries which my jobs
> > link
> > > to in a MapReduce job? I added the following lines in mapred-site.xml
> in
> > > all
> > > my nodes and put the external library jars in the specified directory
> but
> > I
> > > am getting ClassNotFoundException:
> > >
> > > 
> > >  mapred.child.java.opts
> > >  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs
> > > 
> > >
> > > Am I doing anything wrong? Is there any other way to solve my problem?
> > >
> > > Thanks,
> > > Farhan
> > >
> >
>


Re: Try to mount HDFS

2010-04-22 Thread paul
Just a heads up on this, we've run into problems when trying to use fuse to
mount dfs running on port :8020.  However, it works fine when we ran it on
:9000.


-paul


On Thu, Apr 22, 2010 at 7:59 PM, Brian Bockelman wrote:

> Hey Christian,
>
> I've run into this before.
>
> Make sure that the hostname/port you give to fuse is EXACTLY the same as
> listed in hadoop-site.xml.
>
> If these aren't the same text string (including the ":8020"), then you get
> those sort of issues.
>
> Brian
>
> On Apr 22, 2010, at 5:00 AM, Christian Baun wrote:
>
> > Dear All,
> >
> > I want to test HDFS inside Amazon EC2.
> >
> > Two Ubuntu instances are running inside EC2.
> > One server is namenode and jobtracker. The other server is the datanode.
> > Cloudera (hadoop-0.20) is installed and running.
> >
> > Now, I want to mount HDFS.
> > I tried to install contrib/fuse-dfs as described here:
> > http://wiki.apache.org/hadoop/MountableHDFS
> >
> > The compilation worked via:
> >
> > # ant compile-c++-libhdfs -Dlibhdfs=1
> > # ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/
> -Dforrest.home=/home/ubuntu/apache-forrest-0.8/
> > # ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1
> >
> > But now, when I try to mount the filesystem:
> >
> > # ./fuse_dfs_wrapper.sh dfs://
> ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/ -d
> > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > fuse-dfs didn't recognize /mnt/hdfs/,-2
> > fuse-dfs ignoring option -d
> > FUSE library version: 2.8.1
> > nullpath_ok: 0
> > unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
> > INIT: 7.13
> > flags=0x007b
> > max_readahead=0x0002
> >   INIT: 7.12
> >   flags=0x0011
> >   max_readahead=0x0002
> >   max_write=0x0002
> >   unique: 1, success, outsize: 40
> >
> >
> > # ./fuse_dfs_wrapper.sh dfs://
> ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/
> > port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> > fuse-dfs didn't recognize /mnt/hdfs/,-2
> >
> > # ls /mnt/hdfs/
> > ls: reading directory /mnt/hdfs/: Input/output error
> > # ls /mnt/hdfs/
> > ls: cannot access /mnt/hdfs/o¢  : No such file or directory
> > o???
> > # ls /mnt/hdfs/
> > ls: reading directory /mnt/hdfs/: Input/output error
> > # ls /mnt/hdfs/
> > ls: cannot access /mnt/hdfs/`á›Óÿ : No such file or directory
> > `?
> > # ls /mnt/hdfs/
> > ls: reading directory /mnt/hdfs/: Input/output error
> > ...
> >
> >
> > What can I do at this point?
> >
> > Thanks in advance
> > Christian
>
>


Re: Try to mount HDFS

2010-04-22 Thread Brian Bockelman
Hey Christian,

I've run into this before.

Make sure that the hostname/port you give to fuse is EXACTLY the same as listed 
in hadoop-site.xml.

If these aren't the same text string (including the ":8020"), then you get 
those sort of issues.

Brian

On Apr 22, 2010, at 5:00 AM, Christian Baun wrote:

> Dear All,
> 
> I want to test HDFS inside Amazon EC2.
> 
> Two Ubuntu instances are running inside EC2. 
> One server is namenode and jobtracker. The other server is the datanode.
> Cloudera (hadoop-0.20) is installed and running.
> 
> Now, I want to mount HDFS.
> I tried to install contrib/fuse-dfs as described here:
> http://wiki.apache.org/hadoop/MountableHDFS
> 
> The compilation worked via:
> 
> # ant compile-c++-libhdfs -Dlibhdfs=1
> # ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/ 
> -Dforrest.home=/home/ubuntu/apache-forrest-0.8/
> # ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1
> 
> But now, when I try to mount the filesystem:
> 
> # ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 
> /mnt/hdfs/ -d
> port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> fuse-dfs didn't recognize /mnt/hdfs/,-2
> fuse-dfs ignoring option -d
> FUSE library version: 2.8.1
> nullpath_ok: 0
> unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
> INIT: 7.13
> flags=0x007b
> max_readahead=0x0002
>   INIT: 7.12
>   flags=0x0011
>   max_readahead=0x0002
>   max_write=0x0002
>   unique: 1, success, outsize: 40
> 
> 
> # ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 
> /mnt/hdfs/
> port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
> fuse-dfs didn't recognize /mnt/hdfs/,-2
> 
> # ls /mnt/hdfs/
> ls: reading directory /mnt/hdfs/: Input/output error
> # ls /mnt/hdfs/
> ls: cannot access /mnt/hdfs/o¢: No such file or directory
> o???
> # ls /mnt/hdfs/
> ls: reading directory /mnt/hdfs/: Input/output error
> # ls /mnt/hdfs/
> ls: cannot access /mnt/hdfs/`á›Óÿ: No such file or directory
> `?
> # ls /mnt/hdfs/
> ls: reading directory /mnt/hdfs/: Input/output error
> ...
> 
> 
> What can I do at this point?
> 
> Thanks in advance
> Christian



smime.p7s
Description: S/MIME cryptographic signature


Re: Using external library in MapReduce jobs

2010-04-22 Thread Farhan Husain
Thanks Raghava.

On Thu, Apr 22, 2010 at 6:04 PM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Hello Farhan,
>
>I use an external library and I run the MR job from command line. So
> I specify it in -libjars as follows
>
> hadoop jar (my jar) (my class) -libjars (external jar) (args for my class)
>
> Raghava.
>
> On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain  >wrote:
>
> > Hello guys,
> >
> > Can you please tell me how I can use external libraries which my jobs
> link
> > to in a MapReduce job? I added the following lines in mapred-site.xml in
> > all
> > my nodes and put the external library jars in the specified directory but
> I
> > am getting ClassNotFoundException:
> >
> > 
> >  mapred.child.java.opts
> >  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs
> > 
> >
> > Am I doing anything wrong? Is there any other way to solve my problem?
> >
> > Thanks,
> > Farhan
> >
>


Re: Using external library in MapReduce jobs

2010-04-22 Thread Farhan Husain
Thanks Alex, it is about java libs. I will try to follow both Raghava's and
your approach. I wanted to be able to run the job from Eclipse, it seems
that your one is better suited for that.

On Thu, Apr 22, 2010 at 6:28 PM, Alex Kozlov  wrote:

> Hi Farhan,
>
> Are you talking about java libs (jar) or native libs (.so, etc)?
>
> *Jars:*
>
> You can just jar it with your jar file, just put it in a lib subdirectory
> of
> your jar root directory
>
> *Native:
>
> *Put them into $HADOOP_HOME/lib/native/$PLATFORM/ on each node in the
> cluster
>
> where PLATFORM is the string returned by `hadoop
> org.apache.hadoop.util.PlatformName`
>
> There is a way to distribute native libs runtime, but it's more involved.
>
> Alex K
>
> On Thu, Apr 22, 2010 at 4:04 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
> > Hello Farhan,
> >
> >I use an external library and I run the MR job from command line.
> So
> > I specify it in -libjars as follows
> >
> > hadoop jar (my jar) (my class) -libjars (external jar) (args for my
> class)
> >
> > Raghava.
> >
> > On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain <
> farhan.hus...@csebuet.org
> > >wrote:
> >
> > > Hello guys,
> > >
> > > Can you please tell me how I can use external libraries which my jobs
> > link
> > > to in a MapReduce job? I added the following lines in mapred-site.xml
> in
> > > all
> > > my nodes and put the external library jars in the specified directory
> but
> > I
> > > am getting ClassNotFoundException:
> > >
> > > 
> > >  mapred.child.java.opts
> > >  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs
> > > 
> > >
> > > Am I doing anything wrong? Is there any other way to solve my problem?
> > >
> > > Thanks,
> > > Farhan
> > >
> >
>


Re: Using external library in MapReduce jobs

2010-04-22 Thread Alex Kozlov
Hi Farhan,

Are you talking about java libs (jar) or native libs (.so, etc)?

*Jars:*

You can just jar it with your jar file, just put it in a lib subdirectory of
your jar root directory

*Native:

*Put them into $HADOOP_HOME/lib/native/$PLATFORM/ on each node in the
cluster

where PLATFORM is the string returned by `hadoop
org.apache.hadoop.util.PlatformName`

There is a way to distribute native libs runtime, but it's more involved.

Alex K

On Thu, Apr 22, 2010 at 4:04 PM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Hello Farhan,
>
>I use an external library and I run the MR job from command line. So
> I specify it in -libjars as follows
>
> hadoop jar (my jar) (my class) -libjars (external jar) (args for my class)
>
> Raghava.
>
> On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain  >wrote:
>
> > Hello guys,
> >
> > Can you please tell me how I can use external libraries which my jobs
> link
> > to in a MapReduce job? I added the following lines in mapred-site.xml in
> > all
> > my nodes and put the external library jars in the specified directory but
> I
> > am getting ClassNotFoundException:
> >
> > 
> >  mapred.child.java.opts
> >  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs
> > 
> >
> > Am I doing anything wrong? Is there any other way to solve my problem?
> >
> > Thanks,
> > Farhan
> >
>


Re: Using external library in MapReduce jobs

2010-04-22 Thread Raghava Mutharaju
Hello Farhan,

I use an external library and I run the MR job from command line. So
I specify it in -libjars as follows

hadoop jar (my jar) (my class) -libjars (external jar) (args for my class)

Raghava.

On Thu, Apr 22, 2010 at 6:21 PM, Farhan Husain wrote:

> Hello guys,
>
> Can you please tell me how I can use external libraries which my jobs link
> to in a MapReduce job? I added the following lines in mapred-site.xml in
> all
> my nodes and put the external library jars in the specified directory but I
> am getting ClassNotFoundException:
>
> 
>  mapred.child.java.opts
>  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs
> 
>
> Am I doing anything wrong? Is there any other way to solve my problem?
>
> Thanks,
> Farhan
>


Using external library in MapReduce jobs

2010-04-22 Thread Farhan Husain
Hello guys,

Can you please tell me how I can use external libraries which my jobs link
to in a MapReduce job? I added the following lines in mapred-site.xml in all
my nodes and put the external library jars in the specified directory but I
am getting ClassNotFoundException:


  mapred.child.java.opts
  -Xmx512m -Djava.library.path=/hadoop/Hadoop/userlibs


Am I doing anything wrong? Is there any other way to solve my problem?

Thanks,
Farhan


Re: pomsets: workflow management for your cloud

2010-04-22 Thread Allen Wittenauer


The big one is license.  Azkaban is fully APL 2.0, pomsets is dual licensed 
(GPL for non-commercial, $$ for commercial).

 
On Apr 19, 2010, at 10:47 AM, Otis Gospodnetic wrote:

> Mike,
> 
> Would you happen to have a page with information about how this is similar or 
> different to something like Azkaban?
> 
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
> 
> 
> 
> - Original Message 
>> From: michael j pan 
>> To: common-user@hadoop.apache.org
>> Sent: Sun, March 21, 2010 11:26:17 PM
>> Subject: pomsets: workflow management for your cloud
>> 
>> Apologies for the repost.  The previous message was sent from 
>> my
> personal account and caused confusion for some people.
> 
> 
> I'd 
>> like to invite the Hadoop community to check out the application
> I've 
>> developed.  Its name is pomsets, and it as a workflow management
> system 
>> for your cloud.  In short, it allows you to specify the jobs
> you want to 
>> run, the dependencies between those jobs, and executes
> them on your cloud 
>> (whether that cloud is a public or private cloud).
> One of its features is its 
>> integration of Hadoop, so that you can run
> your Hadoop jobs in your 
>> computational workflows alongside non-Hadoop
> jobs.
> 
> 
>> href="http://pomsets.org"; target=_blank 
>>> http://pomsets.org
> 
> Thanks
> Mike



Re: Obtaining name of file in map task

2010-04-22 Thread Farhan Husain
The bug report here
tells that it is
not a bug at all. Hopefully the documentation will be
updated as mentioned in the comment.

On Wed, Jan 20, 2010 at 2:19 PM, Farhan Husain wrote:

> You can try to following code:
>
> FileSplit fileSplit = (FileSplit) context.getInputSplit();
> String sFileName = fileSplit.getPath().getName();
>
>
> On Tue, Jan 12, 2010 at 10:04 AM, Raymond Jennings III <
> raymondj...@yahoo.com> wrote:
>
>> I am trying to determine what the name of the file that is being used for
>> the map task.  I am trying to use the setup() method to read the input file
>> with:
>>
>> public void setup(Context context) {
>>
>>Configuration conf = context.getConfiguration();
>>String inputfile = conf.get("map.input.file");
>> ..
>>
>> But inputfile is always null.  Anyone have a pointer on how to do this?
>>  Thanks.
>>
>>
>>
>>
>


Re: File permissions on S3FileSystem

2010-04-22 Thread Tom White
Hi Danny,

S3FileSystem has no concept of permissions, which is why this check
fails. The change that introduced the permissions check was introduced
in https://issues.apache.org/jira/browse/MAPREDUCE-181. Could you file
a bug for this please?

Cheers,
Tom

On Thu, Apr 22, 2010 at 4:16 AM, Danny Leshem  wrote:
> Hello,
>
> I'm running a Hadoop cluster using 3 small Amazon EC2 machines and the
> S3FileSystem.
> Till lately I've been using 0.20.2 and everything was ok.
>
> Now I'm using the latest trunc 0.22.0-SNAPSHOT and getting the following
> thrown:
>
> Exception in thread "main" java.io.IOException: The ownership/permissions on
> the staging directory
> s3://my-s3-bucket/mnt/hadoop.tmp.dir/mapred/staging/root/.staging is not as
> expected. It is owned by  and permissions are rwxrwxrwx. The directory must
> be owned by the submitter root or by root and permissions must be rwx--
>    at
> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:107)
>    at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:312)
>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:961)
>    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:977)
>    at com.mycompany.MyJob.runJob(MyJob.java:153)
>    at com.mycompany.MyJob.run(MyJob.java:177)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at com.mycompany.MyOtherJob.runJob(MyOtherJob.java:62)
>    at com.mycompany.MyOtherJob.run(MyOtherJob.java:112)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>    at com.mycompany.MyOtherJob.main(MyOtherJob.java:117)
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>    at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>    at java.lang.reflect.Method.invoke(Method.java:597)
>    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
>
> (The "it is owned by ... and permissions " is not a mistake, seems like the
> empty string is printed there)
>
> My configuration is as follows:
>
> core-site:
> fs.default.name=s3://my-s3-bucket
> fs.s3.awsAccessKeyId=[key id omitted]
> fs.s3.awsSecretAccessKey=[secret key omitted]
> hadoop.tmp.dir=/mnt/hadoop.tmp.dir
>
> hdfs-site: empty
>
> mapred-site:
> mapred.job.tracker=[domU-XX-XX-XX-XX-XX-XX.compute-1.internal:9001]
> mapred.map.tasks=6
> mapred.reduce.tasks=6
>
> Any help would be appreciated...
>
> Best,
> Danny
>


Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
In some extents, for 30GB file, if it is well balanced the overhead imposed
by data locality may not be too much. We will see. I will report my results
to this mail-list.

On Thu, Apr 22, 2010 at 2:44 PM, Allen Wittenauer
wrote:

>
> On Apr 22, 2010, at 11:46 AM, He Chen wrote:
>
> > Yes, but if you have more mappers, you may have more waves to execute. I
> > mean if I have 110 mappers for a job and I only have 22 cores. Then, it
> will
> > execute 5 waves approximately, If I have only 22 mappers, It will save
> the
> > overhead time.
>
> But you'll sacrifice data locality, which means that instead of testing the
> cpu, you'll be testing cpu+network.
>
>
>


-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Hadoop does not follow my setting

2010-04-22 Thread Allen Wittenauer

On Apr 22, 2010, at 11:46 AM, He Chen wrote:

> Yes, but if you have more mappers, you may have more waves to execute. I
> mean if I have 110 mappers for a job and I only have 22 cores. Then, it will
> execute 5 waves approximately, If I have only 22 mappers, It will save the
> overhead time.

But you'll sacrifice data locality, which means that instead of testing the 
cpu, you'll be testing cpu+network.




Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
Yes, but if you have more mappers, you may have more waves to execute. I
mean if I have 110 mappers for a job and I only have 22 cores. Then, it will
execute 5 waves approximately, If I have only 22 mappers, It will save the
overhead time.

2010/4/22 Edward Capriolo 

> 2010/4/22 He Chen 
>
> > Hi Raymond Jennings III
> >
> > I use 22 mappers because I have 22 cores in my clusters. Is this what you
> > want?
> >
> > On Thu, Apr 22, 2010 at 11:55 AM, Raymond Jennings III <
> > raymondj...@yahoo.com> wrote:
> >
> > > Isn't the number of mappers specified "only a suggestion" ?
> > >
> > > --- On Thu, 4/22/10, He Chen  wrote:
> > >
> > > > From: He Chen 
> > > > Subject: Hadoop does not follow my setting
> > > > To: common-user@hadoop.apache.org
> > > > Date: Thursday, April 22, 2010, 12:50 PM
> > >  > Hi everyone
> > > >
> > > > I am doing a benchmark by using Hadoop 0.20.0's wordcount
> > > > example. I have a
> > > > 30GB file. I plan to test differenct number of mappers'
> > > > performance. For
> > > > example, for a wordcount job, I plan to test 22 mappers, 44
> > > > mappers, 66
> > > > mappers and 110 mappers.
> > > >
> > > > However, I set the "mapred.map.tasks" equals to 22. But
> > > > when I ran the job,
> > > > it shows 436 mappers total.
> > > >
> > > > I think maybe the wordcount set its parameters inside the
> > > > its own program. I
> > > > give "-Dmapred.map.tasks=22" to this program. But it is
> > > > still 436 again in
> > > > my another try.  I found out that 30GB divide by 436
> > > > is just 64MB, it is
> > > > just my block size.
> > > >
> > > > Any suggestions will be appreciated.
> > > >
> > > > Thank you in advance!
> > > >
> > > > --
> > > > Best Wishes!
> > > > 顺送商祺!
> > > >
> > > > --
> > > > Chen He
> > > > (402)613-9298
> > > > PhD. student of CSE Dept.
> > > > Holland Computing Center
> > > > University of Nebraska-Lincoln
> > > > Lincoln NE 68588
> > > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Best Wishes!
> > 顺送商祺!
> >
> > --
> > Chen He
> > (402)613-9298
> > PhD. student of CSE Dept.
> > Holland Computing Center
> > University of Nebraska-Lincoln
> > Lincoln NE 68588
> >
>
> No matter how many total mappers exist for the job only a certain number of
> them run at once.
>



-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Hadoop does not follow my setting

2010-04-22 Thread Edward Capriolo
2010/4/22 He Chen 

> Hi Raymond Jennings III
>
> I use 22 mappers because I have 22 cores in my clusters. Is this what you
> want?
>
> On Thu, Apr 22, 2010 at 11:55 AM, Raymond Jennings III <
> raymondj...@yahoo.com> wrote:
>
> > Isn't the number of mappers specified "only a suggestion" ?
> >
> > --- On Thu, 4/22/10, He Chen  wrote:
> >
> > > From: He Chen 
> > > Subject: Hadoop does not follow my setting
> > > To: common-user@hadoop.apache.org
> > > Date: Thursday, April 22, 2010, 12:50 PM
> >  > Hi everyone
> > >
> > > I am doing a benchmark by using Hadoop 0.20.0's wordcount
> > > example. I have a
> > > 30GB file. I plan to test differenct number of mappers'
> > > performance. For
> > > example, for a wordcount job, I plan to test 22 mappers, 44
> > > mappers, 66
> > > mappers and 110 mappers.
> > >
> > > However, I set the "mapred.map.tasks" equals to 22. But
> > > when I ran the job,
> > > it shows 436 mappers total.
> > >
> > > I think maybe the wordcount set its parameters inside the
> > > its own program. I
> > > give "-Dmapred.map.tasks=22" to this program. But it is
> > > still 436 again in
> > > my another try.  I found out that 30GB divide by 436
> > > is just 64MB, it is
> > > just my block size.
> > >
> > > Any suggestions will be appreciated.
> > >
> > > Thank you in advance!
> > >
> > > --
> > > Best Wishes!
> > > 顺送商祺!
> > >
> > > --
> > > Chen He
> > > (402)613-9298
> > > PhD. student of CSE Dept.
> > > Holland Computing Center
> > > University of Nebraska-Lincoln
> > > Lincoln NE 68588
> > >
> >
> >
> >
> >
>
>
> --
> Best Wishes!
> 顺送商祺!
>
> --
> Chen He
> (402)613-9298
> PhD. student of CSE Dept.
> Holland Computing Center
> University of Nebraska-Lincoln
> Lincoln NE 68588
>

No matter how many total mappers exist for the job only a certain number of
them run at once.


Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
Hi Raymond Jennings III

I use 22 mappers because I have 22 cores in my clusters. Is this what you
want?

On Thu, Apr 22, 2010 at 11:55 AM, Raymond Jennings III <
raymondj...@yahoo.com> wrote:

> Isn't the number of mappers specified "only a suggestion" ?
>
> --- On Thu, 4/22/10, He Chen  wrote:
>
> > From: He Chen 
> > Subject: Hadoop does not follow my setting
> > To: common-user@hadoop.apache.org
> > Date: Thursday, April 22, 2010, 12:50 PM
>  > Hi everyone
> >
> > I am doing a benchmark by using Hadoop 0.20.0's wordcount
> > example. I have a
> > 30GB file. I plan to test differenct number of mappers'
> > performance. For
> > example, for a wordcount job, I plan to test 22 mappers, 44
> > mappers, 66
> > mappers and 110 mappers.
> >
> > However, I set the "mapred.map.tasks" equals to 22. But
> > when I ran the job,
> > it shows 436 mappers total.
> >
> > I think maybe the wordcount set its parameters inside the
> > its own program. I
> > give "-Dmapred.map.tasks=22" to this program. But it is
> > still 436 again in
> > my another try.  I found out that 30GB divide by 436
> > is just 64MB, it is
> > just my block size.
> >
> > Any suggestions will be appreciated.
> >
> > Thank you in advance!
> >
> > --
> > Best Wishes!
> > 顺送商祺!
> >
> > --
> > Chen He
> > (402)613-9298
> > PhD. student of CSE Dept.
> > Holland Computing Center
> > University of Nebraska-Lincoln
> > Lincoln NE 68588
> >
>
>
>
>


-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
Hey Eric Sammer

Thank you for the reply. Actually, I only care about the number of mappers
in my circumstance. Looks like, I should write the wordcount program with my
own InputFormat class.

2010/4/22 Eric Sammer 

> This is normal and expected. The mapred.map.tasks parameter is only a
> hint. The InputFormat gets to decide how to calculate splits.
> FileInputFormat and all subclasses, including TextInputFormat, use a
> few parameters to figure out what the appropriate split size will be
> but under most circumstances, this winds up being the block size. If
> you used fewer map tasks than blocks, you would sacrifice data
> locality which would only hurt performance.
>
> 2010/4/22 He Chen :
>  > Hi everyone
> >
> > I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have
> a
> > 30GB file. I plan to test differenct number of mappers' performance. For
> > example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66
> > mappers and 110 mappers.
> >
> > However, I set the "mapred.map.tasks" equals to 22. But when I ran the
> job,
> > it shows 436 mappers total.
> >
> > I think maybe the wordcount set its parameters inside the its own
> program. I
> > give "-Dmapred.map.tasks=22" to this program. But it is still 436 again
> in
> > my another try.  I found out that 30GB divide by 436 is just 64MB, it is
> > just my block size.
> >
> > Any suggestions will be appreciated.
> >
> > Thank you in advance!
> >
> > --
> > Best Wishes!
> > 顺送商祺!
> >
> > --
> > Chen He
> > (402)613-9298
> > PhD. student of CSE Dept.
> > Holland Computing Center
> > University of Nebraska-Lincoln
> > Lincoln NE 68588
> >
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>


Re: Hadoop does not follow my setting

2010-04-22 Thread Raymond Jennings III
Isn't the number of mappers specified "only a suggestion" ?

--- On Thu, 4/22/10, He Chen  wrote:

> From: He Chen 
> Subject: Hadoop does not follow my setting
> To: common-user@hadoop.apache.org
> Date: Thursday, April 22, 2010, 12:50 PM
> Hi everyone
> 
> I am doing a benchmark by using Hadoop 0.20.0's wordcount
> example. I have a
> 30GB file. I plan to test differenct number of mappers'
> performance. For
> example, for a wordcount job, I plan to test 22 mappers, 44
> mappers, 66
> mappers and 110 mappers.
> 
> However, I set the "mapred.map.tasks" equals to 22. But
> when I ran the job,
> it shows 436 mappers total.
> 
> I think maybe the wordcount set its parameters inside the
> its own program. I
> give "-Dmapred.map.tasks=22" to this program. But it is
> still 436 again in
> my another try.  I found out that 30GB divide by 436
> is just 64MB, it is
> just my block size.
> 
> Any suggestions will be appreciated.
> 
> Thank you in advance!
> 
> -- 
> Best Wishes!
> 顺送商祺!
> 
> --
> Chen He
> (402)613-9298
> PhD. student of CSE Dept.
> Holland Computing Center
> University of Nebraska-Lincoln
> Lincoln NE 68588
> 


   


Re: Hadoop does not follow my setting

2010-04-22 Thread Eric Sammer
This is normal and expected. The mapred.map.tasks parameter is only a
hint. The InputFormat gets to decide how to calculate splits.
FileInputFormat and all subclasses, including TextInputFormat, use a
few parameters to figure out what the appropriate split size will be
but under most circumstances, this winds up being the block size. If
you used fewer map tasks than blocks, you would sacrifice data
locality which would only hurt performance.

2010/4/22 He Chen :
> Hi everyone
>
> I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a
> 30GB file. I plan to test differenct number of mappers' performance. For
> example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66
> mappers and 110 mappers.
>
> However, I set the "mapred.map.tasks" equals to 22. But when I ran the job,
> it shows 436 mappers total.
>
> I think maybe the wordcount set its parameters inside the its own program. I
> give "-Dmapred.map.tasks=22" to this program. But it is still 436 again in
> my another try.  I found out that 30GB divide by 436 is just 64MB, it is
> just my block size.
>
> Any suggestions will be appreciated.
>
> Thank you in advance!
>
> --
> Best Wishes!
> 顺送商祺!
>
> --
> Chen He
> (402)613-9298
> PhD. student of CSE Dept.
> Holland Computing Center
> University of Nebraska-Lincoln
> Lincoln NE 68588
>



-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com


Lucandra - Lucene/Solr on Cassandra: April 26, NYC

2010-04-22 Thread Otis Gospodnetic
Hello folks,

Those of you in or near NYC and using Lucene or Solr should come to "Lucandra - 
a Cassandra-based backend for Lucene and Solr" on April 26th:

http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/

The presenter will be Lucandra's author, Jake Luciani.

Please spread the word.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


Hadoop does not follow my setting

2010-04-22 Thread He Chen
Hi everyone

I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a
30GB file. I plan to test differenct number of mappers' performance. For
example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66
mappers and 110 mappers.

However, I set the "mapred.map.tasks" equals to 22. But when I ran the job,
it shows 436 mappers total.

I think maybe the wordcount set its parameters inside the its own program. I
give "-Dmapred.map.tasks=22" to this program. But it is still 436 again in
my another try.  I found out that 30GB divide by 436 is just 64MB, it is
just my block size.

Any suggestions will be appreciated.

Thank you in advance!

-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


separate JVM flags for map and reduce tasks

2010-04-22 Thread Vasilis Liaskovitis
Hi,

I 'd like to pass different JVM options for map tasks and different
ones for reduce tasks. I think it should be straightforward to add
mapred.mapchild.java.opts, mapred.reducechild.java.opts to my
conf/mapred-site.xml and process the new options accordingly in
src/mapred/org/apache/mapreduce/TaskRunner.java . Let me know if you
think it's more involved than what I described.

My question is: if mapred.job.reuse.jvm.num.tasks is set to -1 (always
reuse), can the same JVM be re-used for different types of tasks? So
the same JVM being used e.g. first by a map task and then used by
reduce task. I am assuming this is definitely possible, though I
haven't verified in the code.
So , if one wants to pass different jvm options to map tasks and
reduce tasks, perhaps jobs.reuse.jvm.num.task should be set to 1
(never reuse) ?

thanks for your help,

- Vasilis


File permissions on S3FileSystem

2010-04-22 Thread Danny Leshem
Hello,

I'm running a Hadoop cluster using 3 small Amazon EC2 machines and the
S3FileSystem.
Till lately I've been using 0.20.2 and everything was ok.

Now I'm using the latest trunc 0.22.0-SNAPSHOT and getting the following
thrown:

Exception in thread "main" java.io.IOException: The ownership/permissions on
the staging directory
s3://my-s3-bucket/mnt/hadoop.tmp.dir/mapred/staging/root/.staging is not as
expected. It is owned by  and permissions are rwxrwxrwx. The directory must
be owned by the submitter root or by root and permissions must be rwx--
at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:107)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:312)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:961)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:977)
at com.mycompany.MyJob.runJob(MyJob.java:153)
at com.mycompany.MyJob.run(MyJob.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.mycompany.MyOtherJob.runJob(MyOtherJob.java:62)
at com.mycompany.MyOtherJob.run(MyOtherJob.java:112)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.mycompany.MyOtherJob.main(MyOtherJob.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

(The "it is owned by ... and permissions " is not a mistake, seems like the
empty string is printed there)

My configuration is as follows:

core-site:
fs.default.name=s3://my-s3-bucket
fs.s3.awsAccessKeyId=[key id omitted]
fs.s3.awsSecretAccessKey=[secret key omitted]
hadoop.tmp.dir=/mnt/hadoop.tmp.dir

hdfs-site: empty

mapred-site:
mapred.job.tracker=[domU-XX-XX-XX-XX-XX-XX.compute-1.internal:9001]
mapred.map.tasks=6
mapred.reduce.tasks=6

Any help would be appreciated...

Best,
Danny


Re: Hadoop eclipse java.io.EOFexception

2010-04-22 Thread Shevek
On Wed, 2010-04-21 at 00:22 -0700, rahulBhatia wrote:
> Hello!
> 
> I'm a newbie with Hadoop and I'm just learning to set it up for a class
> project. I followed all the steps in the yahoo tutorial and I'm running
> hadoop on a virtual machine in VMplayer. I'm running Windows 7 on the host
> machine. I'm stuck at trying to access the Hadoop HDFS from Eclipse! I've
> done everything possible and I've tried everything recommended by people in
> other threads.

Hi,

You may find some of the answer to your troubles with an alternative
plugin, which you can get from http://www.hadoopstudio.org/

It's somewhat richer in features, and I know of classes which have run
successfully using it. It also works natively on Windows without using
cygwin or any other tricks.

Good luck.

S.

-- 
http://www.hadoopstudio.org/
Karmasphere Studio for Hadoop - An intuitive visual interface to Big Data



Re: Hadoop performance - xfs and ext4

2010-04-22 Thread Steve Loughran

stephen mulcahy wrote:

Hi,

I've been tweaking our cluster roll-out process to refine it. While 
doing so, I decided to check if XFS gives any performance benefit over 
EXT4.


As per a comment I read somewhere on the hbase wiki - XFS makes for 
faster formatting of filesystems (it takes us 5.5 minutes to rebuild a 
datanode from bare metal to a full Hadoop config on top of Debian 
Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes).


However, TeraSort performance on a cluster of 45 of these data-nodes 
shows XFS is slower (same configuration settings on both installs other 
than changed filesystem), specifically,


mkfs.xfs -f -l size=64m DEV
(mounted with noatime,nodiratime,logbufs=8)
gives me a cluster which runs TeraSort in about 23 minutes

mkfs.ext4 -T largefile4 DEV
(mounted with noatime)
gives me a cluster which runs TeraSort in about 18.5 minutes

So I'll be rolling our cluster back to EXT4, but thought the information 
might be useful/interesting to others.


-stephen


XFS config chosen from notes at 
http://everything2.com/index.pl?node_id=1479435




That's really interesting. Do you want to update the bits of the Hadoop 
wiki that talks about filesystems?





Re: Hadoop Index Contrib

2010-04-22 Thread Renaud Delbru

Hi Otis,

issue has been opened [1], and first patch submitted.

[1] https://issues.apache.org/jira/browse/MAPREDUCE-1722
--
Renaud Delbru

On 21/04/10 19:16, Otis Gospodnetic wrote:

Hi Renaud,

I think you should just open a new one unless one specifically for that already 
exists (I'm guessing it doesn't)
You can use http://www.search-hadoop.com/ to easily check for that.  For example: 
http://www.search-hadoop.com/?q=contrib+index+&fc_type=jira



Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
   

From: Renaud Delbru
To: common-user@hadoop.apache.org
Sent: Wed, April 21, 2010 9:57:37 AM
Subject: Re: Hadoop Index Contrib

Hi Otis,
 

can you point me to an issue in JIRA where I can post the patch,
   

or do I
 

open a new issue ?

Cheers
   




Re: Hadoop performance - xfs and ext4

2010-04-22 Thread Andrew Klochkov
Hi,

Just curious - did you try ext3? Can it be faster then ext4? Hadoop wiki
suggests ext3 as it's used mostly for hadoop clusters:

http://wiki.apache.org/hadoop/DiskSetup

On Thu, Apr 22, 2010 at 12:02 PM, stephen mulcahy
wrote:

> Hi,
>
> I've been tweaking our cluster roll-out process to refine it. While doing
> so, I decided to check if XFS gives any performance benefit over EXT4.
>
> As per a comment I read somewhere on the hbase wiki - XFS makes for faster
> formatting of filesystems (it takes us 5.5 minutes to rebuild a datanode
> from bare metal to a full Hadoop config on top of Debian Squeeze using XFS)
> versus EXT4 (same bare metal restore takes 9 minutes).
>
> However, TeraSort performance on a cluster of 45 of these data-nodes shows
> XFS is slower (same configuration settings on both installs other than
> changed filesystem), specifically,
>
> mkfs.xfs -f -l size=64m DEV
> (mounted with noatime,nodiratime,logbufs=8)
> gives me a cluster which runs TeraSort in about 23 minutes
>
> mkfs.ext4 -T largefile4 DEV
> (mounted with noatime)
> gives me a cluster which runs TeraSort in about 18.5 minutes
>
> So I'll be rolling our cluster back to EXT4, but thought the information
> might be useful/interesting to others.
>
> -stephen
>
>
> XFS config chosen from notes at
> http://everything2.com/index.pl?node_id=1479435
>
> --
> Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
>



-- 
Andrew Klochkov


Try to mount HDFS

2010-04-22 Thread Christian Baun
Dear All,

I want to test HDFS inside Amazon EC2.

Two Ubuntu instances are running inside EC2. 
One server is namenode and jobtracker. The other server is the datanode.
Cloudera (hadoop-0.20) is installed and running.

Now, I want to mount HDFS.
I tried to install contrib/fuse-dfs as described here:
http://wiki.apache.org/hadoop/MountableHDFS

The compilation worked via:

# ant compile-c++-libhdfs -Dlibhdfs=1
# ant package -Djava5.home=/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/ 
-Dforrest.home=/home/ubuntu/apache-forrest-0.8/
# ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1

But now, when I try to mount the filesystem:

# ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 
/mnt/hdfs/ -d
port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
fuse-dfs didn't recognize /mnt/hdfs/,-2
fuse-dfs ignoring option -d
FUSE library version: 2.8.1
nullpath_ok: 0
unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
INIT: 7.13
flags=0x007b
max_readahead=0x0002
   INIT: 7.12
   flags=0x0011
   max_readahead=0x0002
   max_write=0x0002
   unique: 1, success, outsize: 40


# ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 
/mnt/hdfs/
port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
fuse-dfs didn't recognize /mnt/hdfs/,-2

# ls /mnt/hdfs/
ls: reading directory /mnt/hdfs/: Input/output error
# ls /mnt/hdfs/
ls: cannot access /mnt/hdfs/o¢: No such file or directory
o???
# ls /mnt/hdfs/
ls: reading directory /mnt/hdfs/: Input/output error
# ls /mnt/hdfs/
ls: cannot access /mnt/hdfs/`á›Óÿ: No such file or directory
`?
# ls /mnt/hdfs/
ls: reading directory /mnt/hdfs/: Input/output error
...


What can I do at this point?

Thanks in advance
    Christian


Re: Trouble copying local file to hdfs

2010-04-22 Thread manas.tomar


 On Wed, 21 Apr 2010 15:01:17 +0530 Steve Loughran  
wrote  

>manas.tomar wrote: 
>> I have set-up Hadoop on OpenSuse 11.2 VM using Virtualbox. I ran Hadoop 
>> examples in the standalone mode successfully. 
>> Now, I want to run in distributed mode using 2 nodes. 
>> Hadoop starts fine and jps lists all the nodes. But when i try to put any 
>> file or run any example, I get error. For e.g. : 
>> 
>> had...@master:~/hadoop> ./bin/hadoop dfs -copyFromLocal ./input inputsample 
>> 10/04/17 14:42:46 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
>> java.net.SocketException: Operation not supported 
>> 10/04/17 14:42:46 INFO hdfs.DFSClient: Abandoning block 
>> blk_8951413748418693186_1080 
>>  
>> 10/04/17 14:43:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
>> java.net.SocketException: Protocol not available 
>> 10/04/17 14:43:04 INFO hdfs.DFSClient: Abandoning block 
>> blk_838428157309440632_1081 
>> 10/04/17 14:43:10 WARN hdfs.DFSClient: DataStreamer Exception: 
>> java.io.IOException: Unable to create new block. 
>> at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
>>  
>> at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>>  
>> at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>>  
>> 
>> 10/04/17 14:43:10 WARN hdfs.DFSClient: Error Recovery for block 
>> blk_838428157309440632_1081 bad datanode[0] nodes == null 
>> 10/04/17 14:43:10 WARN hdfs.DFSClient: Could not get block locations. Source 
>> file "/user/hadoop/inputsample/check" - Aborting... 
>> copyFromLocal: Protocol not available 
>> 10/04/17 14:43:10 ERROR hdfs.DFSClient: Exception closing file 
>> /user/hadoop/inputsample/check : java.net.SocketException: Protocol not 
>> available 
>> java.net.SocketException: Protocol not available 
>> at sun.nio.ch.Net.getIntOption0(Native Method) 
>> at sun.nio.ch.Net.getIntOption(Net.java:178) 
>> at sun.nio.ch.SocketChannelImpl$1.getInt(SocketChannelImpl.java:419) 
>> at sun.nio.ch.SocketOptsImpl.getInt(SocketOptsImpl.java:60) 
>> at sun.nio.ch.SocketOptsImpl.sendBufferSize(SocketOptsImpl.java:156) 
>> at sun.nio.ch.SocketOptsImpl$IP$TCP.sendBufferSize(SocketOptsImpl.java:286) 
>> at sun.nio.ch.OptionAdaptor.getSendBufferSize(OptionAdaptor.java:129) 
>> at sun.nio.ch.SocketAdaptor.getSendBufferSize(SocketAdaptor.java:328) 
>> at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2873)
>>  
>> at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>>  
>> at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>>  
>> at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>>  
>> 
>> 
>> I can see the files on HDFS through the web interface but they are empty. 
>> Any suggestion on how can I get over this ? 
>> 
> 
>That is a very low-level socket error; I would file a bugrep on hadoop 
>and include all machine details, as there is something very odd about 
>your underlying machine or network stack that is stopping hadoop 
>tweaking TCP buffer sizes
>

Thanks.
any suggestions on  how to zero down the cause?
I want to know whether it is Hadoop or my network config 
i.e. any of Opensuse/VirtualBox or Vista before i file a bugrep.



Hadoop performance - xfs and ext4

2010-04-22 Thread stephen mulcahy

Hi,

I've been tweaking our cluster roll-out process to refine it. While 
doing so, I decided to check if XFS gives any performance benefit over EXT4.


As per a comment I read somewhere on the hbase wiki - XFS makes for 
faster formatting of filesystems (it takes us 5.5 minutes to rebuild a 
datanode from bare metal to a full Hadoop config on top of Debian 
Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes).


However, TeraSort performance on a cluster of 45 of these data-nodes 
shows XFS is slower (same configuration settings on both installs other 
than changed filesystem), specifically,


mkfs.xfs -f -l size=64m DEV
(mounted with noatime,nodiratime,logbufs=8)
gives me a cluster which runs TeraSort in about 23 minutes

mkfs.ext4 -T largefile4 DEV
(mounted with noatime)
gives me a cluster which runs TeraSort in about 18.5 minutes

So I'll be rolling our cluster back to EXT4, but thought the information 
might be useful/interesting to others.


-stephen


XFS config chosen from notes at 
http://everything2.com/index.pl?node_id=1479435


--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com