Re: can't find java home

2009-08-26 Thread vikas
can you run dos2unix /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/*.sh and
then try again.

Thanks,
-Vikas.

On Wed, Aug 26, 2009 at 11:56 AM, Puri, Aseem aseem.p...@honeywell.comwrote:

 Hi



 I am facing an issue while starting my hadoop cluster.



 When I run the command:



 $ bin/hadoop namenode -format



 I found this exception:



 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 2:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 7:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 10:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 13:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 16:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 19:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 29:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 32:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 35:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 38:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 41:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 46:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 49:
 $'\r': command not found

 /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 52:
 $'\r': command not found

 /bin/java: No such file or directorymin/java/jdk1.6.0_13

 /bin/java: No such file or directorymin/java/jdk1.6.0_13

 /bin/java: cannot execute: No such file or directorydk1.6.0_13



 Also in home/HadoopAdmin/hadoop-0.18.3/bin/conf/hadoop-env.shI set my
 java home as where I have installed java as:



 export JAVA_HOME=/home/HadoopAdmin/java/jdk1.6.0_13



 Please help me on this issue.



 Regards

 Aseem Puri








HBase master not starting

2009-08-26 Thread ilayaraja
Hello,

Iam trying to setup Hbase-0.20 with Hadoop-0.20 in fully distributed mode.
I have problem while starting the Hbase master: The stack trace is as follows



2009-08-26 01:18:31,454 INFO org.apache.hadoop.hbase.master.HMaster: My address 
is domU-12-31-39-00-0A-52.compute-1.internal:6
2009-08-26 01:18:32,600 FATAL org.apache.hadoop.hbase.master.HMaster: Not 
starting HMaster because:
at java.io.DataInputStream.readUTF(DataInputStream.java:572)
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
java.io.EOFException

Please help me out with this. Below is my hbase-site.xml

configuration
  property
namehbase.rootdir/name
valuehdfs://domU-12-31-39-00-28-52.compute-1.internal:40010/hbase/value
descriptionThe directory shared by region servers.
/description
  /property

  property
namehbase.master/name
valuedomU-12-31-39-00-28-52.compute-1.internal:6/value
descriptionThe host and port that the HBase master runs at.
/description
  /property

  property
namehbase.master.port/name
value6/value
descriptionThe port master should bind to./description
  /property

  property
namehbase.cluster.distributed/name
valuetrue/value
descriptionThe mode the cluster will be in. Possible values are
  false: standalone and pseudo-distributed setups with managed Zookeeper
  true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
/description
  /property

/configuration




Testing Hadoop job

2009-08-26 Thread Nikhil Sawant

hi

can u guys suggest some hadoop unit testing framework apart from MRUnit???
i have used MRUnit but i m not sure abt its feasibilty and support to 
hadoop 0.20.
i could not find a proper documentation for MRUnit, is it available 
anywhere?


--
cheers
nikhil



0.19.1 infinite loop

2009-08-26 Thread Jeremy Pinkham
I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram
and 4 cores.  I have several jobs that run every day, and last night one
of them triggered an infinite loop that rendered the cluster inoperable.
As the job finishes, the following is logged to the job tracker logs:

 

2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'attempt_200908220740_0126_r_01_0' has completed
task_200908220740_0126_r_01 successfully.

2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Job
job_200908220740_0126 has completed successfully.

2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/proc/statpump/incremental/200908260200/_logs/history/dup-jt_12509412317
25_job_200908220740_0126_hadoop_statpump-incremental retrying...

 

That last line, Could not complete file... then repeats forever, at
which point the job tracker UI stops responding and no more tasks will
run.  The only way to free things up is to restart the jobtracker

 

Both prior to and during the infinite loop, I see this in the namenode
logs.  Because it starts long before the inifinte loop I can't tell for
sure if it's related, and it is still happening now even after the
restart and with jobs finishing without issue

 

2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310, call
nextGenerationStamp(blk_2796235715791117970_4385127) from
172.21.30.2:48164: error: java.io.IOException:
blk_2796235715791117970_4385127 is already commited, storedBlock ==
null.

java.io.IOException: blk_2796235715791117970_4385127 is already
commited, storedBlock == null.

at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF
orBlock(FSNamesystem.java:4552)

at
org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name
Node.java:402)

at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

 

And finally, this warning appears in the namenode logs just prior as
well

 

2009-08-25 22:07:22,580 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size
for block blk_-1458477261945758787_4416123 reported from
172.21.30.4:50010 current size is 5396992 reported size is 67108864

 

Can anyone point me in a direction to determine what's going here?

 

Thanks



The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.



Re: 0.19.1 infinite loop

2009-08-26 Thread Brian Bockelman

Hey Jeremy,

Glad someone else has run into this!

I always thought this specific infinite loop was in my code.  I had an  
issue open for it earlier, but I ultimately was not sure if it was in  
my code or HDFS, so we closed it:


https://issues.apache.org/jira/browse/HADOOP-4866

We [and others] get these daily.  It would be nice to figure out a way  
to replicate this.


Brian

On Aug 26, 2009, at 8:27 AM, Jeremy Pinkham wrote:


I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram
and 4 cores.  I have several jobs that run every day, and last night  
one
of them triggered an infinite loop that rendered the cluster  
inoperable.

As the job finishes, the following is logged to the job tracker logs:



2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'attempt_200908220740_0126_r_01_0' has completed
task_200908220740_0126_r_01 successfully.

2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:  
Job

job_200908220740_0126 has completed successfully.

2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could  
not

complete file
/proc/statpump/incremental/200908260200/_logs/history/dup- 
jt_12509412317

25_job_200908220740_0126_hadoop_statpump-incremental retrying...



That last line, Could not complete file... then repeats forever, at
which point the job tracker UI stops responding and no more tasks will
run.  The only way to free things up is to restart the jobtracker



Both prior to and during the infinite loop, I see this in the namenode
logs.  Because it starts long before the inifinte loop I can't tell  
for

sure if it's related, and it is still happening now even after the
restart and with jobs finishing without issue



2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310, call
nextGenerationStamp(blk_2796235715791117970_4385127) from
172.21.30.2:48164: error: java.io.IOException:
blk_2796235715791117970_4385127 is already commited, storedBlock ==
null.

java.io.IOException: blk_2796235715791117970_4385127 is already
commited, storedBlock == null.

   at
org 
.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF

orBlock(FSNamesystem.java:4552)

   at
org 
.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name

Node.java:402)

   at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)

   at
sun 
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor

Impl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)

   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)

   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)



And finally, this warning appears in the namenode logs just prior as
well



2009-08-25 22:07:22,580 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size
for block blk_-1458477261945758787_4416123 reported from
172.21.30.4:50010 current size is 5396992 reported size is 67108864



Can anyone point me in a direction to determine what's going here?



Thanks



The information transmitted in this email is intended only for the  
person(s) or entity to which it is addressed and may contain  
confidential and/or privileged material. Any review, retransmission,  
dissemination or other use of, or taking of any action in reliance  
upon, this information by persons or entities other than the  
intended recipient is prohibited. If you received this email in  
error, please contact the sender and permanently delete the email  
from any computer.






smime.p7s
Description: S/MIME cryptographic signature


RE: 0.19.1 infinite loop

2009-08-26 Thread Jeremy Pinkham
Thanks Brian.  I'm trying to find a way to reliably replicate it, and
will certainly update this list if I manage to do so.  It is happening
with more frequency in our QA environment, which is a much smaller
cluster (only 2 nodes), but still not deterministically.  Hopefully we
can hone in on something.

-Original Message-
From: Brian Bockelman [mailto:bbock...@cse.unl.edu] 
Sent: Wednesday, August 26, 2009 9:54 AM
To: common-user@hadoop.apache.org
Subject: Re: 0.19.1 infinite loop

Hey Jeremy,

Glad someone else has run into this!

I always thought this specific infinite loop was in my code.  I had an  
issue open for it earlier, but I ultimately was not sure if it was in  
my code or HDFS, so we closed it:

https://issues.apache.org/jira/browse/HADOOP-4866

We [and others] get these daily.  It would be nice to figure out a way  
to replicate this.

Brian

On Aug 26, 2009, at 8:27 AM, Jeremy Pinkham wrote:

 I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram
 and 4 cores.  I have several jobs that run every day, and last night  
 one
 of them triggered an infinite loop that rendered the cluster  
 inoperable.
 As the job finishes, the following is logged to the job tracker logs:



 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:
 Task 'attempt_200908220740_0126_r_01_0' has completed
 task_200908220740_0126_r_01 successfully.

 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:  
 Job
 job_200908220740_0126 has completed successfully.

 2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could  
 not
 complete file
 /proc/statpump/incremental/200908260200/_logs/history/dup- 
 jt_12509412317
 25_job_200908220740_0126_hadoop_statpump-incremental retrying...



 That last line, Could not complete file... then repeats forever, at
 which point the job tracker UI stops responding and no more tasks will
 run.  The only way to free things up is to restart the jobtracker



 Both prior to and during the infinite loop, I see this in the namenode
 logs.  Because it starts long before the inifinte loop I can't tell  
 for
 sure if it's related, and it is still happening now even after the
 restart and with jobs finishing without issue



 2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 5 on 54310, call
 nextGenerationStamp(blk_2796235715791117970_4385127) from
 172.21.30.2:48164: error: java.io.IOException:
 blk_2796235715791117970_4385127 is already commited, storedBlock ==
 null.

 java.io.IOException: blk_2796235715791117970_4385127 is already
 commited, storedBlock == null.

at
 org 
 .apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF
 orBlock(FSNamesystem.java:4552)

at
 org 
 .apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name
 Node.java:402)

at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)

at
 sun 
 .reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)



 And finally, this warning appears in the namenode logs just prior as
 well



 2009-08-25 22:07:22,580 WARN
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size
 for block blk_-1458477261945758787_4416123 reported from
 172.21.30.4:50010 current size is 5396992 reported size is 67108864



 Can anyone point me in a direction to determine what's going here?



 Thanks



 The information transmitted in this email is intended only for the  
 person(s) or entity to which it is addressed and may contain  
 confidential and/or privileged material. Any review, retransmission,  
 dissemination or other use of, or taking of any action in reliance  
 upon, this information by persons or entities other than the  
 intended recipient is prohibited. If you received this email in  
 error, please contact the sender and permanently delete the email  
 from any computer.



The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.




Re: Intra-datanode balancing?

2009-08-26 Thread Kris Jirapinyo
But I mean, then how does that datanode knows that these files were copied
from one partition to another, in this new directory?  I'm not sure the
inner workings of how a datanode knows what files are on itself...I was
assuming that it knows by keeping track of the subdir directory...or is that
just a placeholder name and whatever directory is under that parent
directory will be scanned and picked up by the datanode?

Kris.

On Tue, Aug 25, 2009 at 6:24 PM, Raghu Angadi rang...@yahoo-inc.com wrote:

 Kris Jirapinyo wrote:

 How does copying the subdir work?  What if that partition already has the
 same subdir (in the case that our partition is not new but relatively
 new...with maybe 10% used)?


 You can copy the files. There isn't really any requirement on number of
 files in  directory. something like cp -r subdir5 dest/subdir5 might do (or
 rsync without --delete option). Just make sure you delete the directory from
 the source.

 Raghu.


  Thanks for the suggestions so far guys.

 Kris.

 On Tue, Aug 25, 2009 at 5:01 PM, Raghu Angadi rang...@yahoo-inc.com
 wrote:

  For now you are stuck with the hack. Sooner or later hadoop has to handle
 heterogeneous nodes better.

 In general it tries to write to all the disks irrespective of % full
 since
 that gives the best performance (assuming each partition's capabilities
 are
 same). But it is lame at handling skews.

 Regd your hack :
  1. You can copy subdir to new partition rather than deleting
(datanodes should be shutdown).

  2. I would think it is less work to implement a better policy in
 DataNode
 for this case. It would be a pretty local change. When choosing a
 partition
 for a new block, DN already knows how much freespace is left on each one.
 for simplest implementation you skip partitions that have less 25% of avg
 freespace or choose with a probability proportional to relative
 freespace.
 If it works well, file a jira.

 I don't think HDFS-343 is directly related to this or is likely to be
 fixed. There is another jira that makes placement policy at NameNode
 pluggable (does not affect Datanode).

 Raghu.


 Kris Jirapinyo wrote:

  Hi all,
   I know this has been filed as a JIRA improvement already
 http://issues.apache.org/jira/browse/HDFS-343, but is there any good
 workaround at the moment?  What's happening is I have added a few new
 EBS
 volumes to half of the cluster, but Hadoop doesn't want to write to
 them.
 When I try to do cluster rebalancing, since the new disks make the
 percentage used lower, it fills up the first two existing local disks,
 which
 is exactly what I don't want to happen.  Currently, I just delete
 several
 subdirs from dfs, since I know that with a replication factor of 3,
 it'll
 be
 ok, so that fixes the problems in the short term.  But I still cannot
 get
 Hadoop to use those new larger disks efficiently.  Any thoughts?

 -- Kris.







Re: Intra-datanode balancing?

2009-08-26 Thread Raghu Angadi

Kris Jirapinyo wrote:

But I mean, then how does that datanode knows that these files were copied
from one partition to another, in this new directory?  I'm not sure the
inner workings of how a datanode knows what files are on itself...I was
assuming that it knows by keeping track of the subdir directory...




or is that
just a placeholder name and whatever directory is under that parent
directory will be scanned and picked up by the datanode?


correct. directory name does not matter. Only requirement is a block 
file and its .meta file in the same directory. When datanode starts up 
it scans all these directories and stores their path in memory.


Of course, this is still a big hack! (just making it clear for readers 
who haven't seen the full context).


Raghu.


Kris.

On Tue, Aug 25, 2009 at 6:24 PM, Raghu Angadi rang...@yahoo-inc.com wrote:


Kris Jirapinyo wrote:


How does copying the subdir work?  What if that partition already has the
same subdir (in the case that our partition is not new but relatively
new...with maybe 10% used)?


You can copy the files. There isn't really any requirement on number of
files in  directory. something like cp -r subdir5 dest/subdir5 might do (or
rsync without --delete option). Just make sure you delete the directory from
the source.

Raghu.


 Thanks for the suggestions so far guys.

Kris.

On Tue, Aug 25, 2009 at 5:01 PM, Raghu Angadi rang...@yahoo-inc.com
wrote:

 For now you are stuck with the hack. Sooner or later hadoop has to handle

heterogeneous nodes better.

In general it tries to write to all the disks irrespective of % full
since
that gives the best performance (assuming each partition's capabilities
are
same). But it is lame at handling skews.

Regd your hack :
 1. You can copy subdir to new partition rather than deleting
   (datanodes should be shutdown).

 2. I would think it is less work to implement a better policy in
DataNode
for this case. It would be a pretty local change. When choosing a
partition
for a new block, DN already knows how much freespace is left on each one.
for simplest implementation you skip partitions that have less 25% of avg
freespace or choose with a probability proportional to relative
freespace.
If it works well, file a jira.

I don't think HDFS-343 is directly related to this or is likely to be
fixed. There is another jira that makes placement policy at NameNode
pluggable (does not affect Datanode).

Raghu.


Kris Jirapinyo wrote:

 Hi all,

  I know this has been filed as a JIRA improvement already
http://issues.apache.org/jira/browse/HDFS-343, but is there any good
workaround at the moment?  What's happening is I have added a few new
EBS
volumes to half of the cluster, but Hadoop doesn't want to write to
them.
When I try to do cluster rebalancing, since the new disks make the
percentage used lower, it fills up the first two existing local disks,
which
is exactly what I don't want to happen.  Currently, I just delete
several
subdirs from dfs, since I know that with a replication factor of 3,
it'll
be
ok, so that fixes the problems in the short term.  But I still cannot
get
Hadoop to use those new larger disks efficiently.  Any thoughts?

-- Kris.









Re: Intra-datanode balancing?

2009-08-26 Thread Kris Jirapinyo
Hmm then in that case, it is possible for me to manually balance load those
datanodes by moving most of the files onto the new, larger partition.  I
will try it.  Thanks!

-- Kris J.

On Wed, Aug 26, 2009 at 10:13 AM, Raghu Angadi rang...@yahoo-inc.comwrote:

 Kris Jirapinyo wrote:

 But I mean, then how does that datanode knows that these files were copied
 from one partition to another, in this new directory?  I'm not sure the
 inner workings of how a datanode knows what files are on itself...I was
 assuming that it knows by keeping track of the subdir directory...



  or is that
 just a placeholder name and whatever directory is under that parent
 directory will be scanned and picked up by the datanode?


 correct. directory name does not matter. Only requirement is a block file
 and its .meta file in the same directory. When datanode starts up it scans
 all these directories and stores their path in memory.

 Of course, this is still a big hack! (just making it clear for readers who
 haven't seen the full context).

 Raghu.


  Kris.

 On Tue, Aug 25, 2009 at 6:24 PM, Raghu Angadi rang...@yahoo-inc.com
 wrote:

  Kris Jirapinyo wrote:

  How does copying the subdir work?  What if that partition already has
 the
 same subdir (in the case that our partition is not new but relatively
 new...with maybe 10% used)?

  You can copy the files. There isn't really any requirement on number of
 files in  directory. something like cp -r subdir5 dest/subdir5 might do
 (or
 rsync without --delete option). Just make sure you delete the directory
 from
 the source.

 Raghu.


  Thanks for the suggestions so far guys.

 Kris.

 On Tue, Aug 25, 2009 at 5:01 PM, Raghu Angadi rang...@yahoo-inc.com
 wrote:

  For now you are stuck with the hack. Sooner or later hadoop has to
 handle

 heterogeneous nodes better.

 In general it tries to write to all the disks irrespective of % full
 since
 that gives the best performance (assuming each partition's capabilities
 are
 same). But it is lame at handling skews.

 Regd your hack :
  1. You can copy subdir to new partition rather than deleting
   (datanodes should be shutdown).

  2. I would think it is less work to implement a better policy in
 DataNode
 for this case. It would be a pretty local change. When choosing a
 partition
 for a new block, DN already knows how much freespace is left on each
 one.
 for simplest implementation you skip partitions that have less 25% of
 avg
 freespace or choose with a probability proportional to relative
 freespace.
 If it works well, file a jira.

 I don't think HDFS-343 is directly related to this or is likely to be
 fixed. There is another jira that makes placement policy at NameNode
 pluggable (does not affect Datanode).

 Raghu.


 Kris Jirapinyo wrote:

  Hi all,

  I know this has been filed as a JIRA improvement already
 http://issues.apache.org/jira/browse/HDFS-343, but is there any good
 workaround at the moment?  What's happening is I have added a few new
 EBS
 volumes to half of the cluster, but Hadoop doesn't want to write to
 them.
 When I try to do cluster rebalancing, since the new disks make the
 percentage used lower, it fills up the first two existing local disks,
 which
 is exactly what I don't want to happen.  Currently, I just delete
 several
 subdirs from dfs, since I know that with a replication factor of 3,
 it'll
 be
 ok, so that fixes the problems in the short term.  But I still cannot
 get
 Hadoop to use those new larger disks efficiently.  Any thoughts?

 -- Kris.








Re: Seattle / NW Hadoop, HBase Lucene, etc. Meetup , Wed August 26th, 6:45pm

2009-08-26 Thread Bradford Stephens

Hello,

My apologies, but there was a mix-up reserving our meeting location,  
and we don't have access to it.


I'm very sorry, and beer is on me next month. Promise :)

Sent from my Internets

On Aug 25, 2009, at 4:21 PM, Bradford Stephens bradfordsteph...@gmail.com 
 wrote:



Hey there,

Apologies for this not going out sooner -- apparently it was sitting
as a draft in my inbox. A few of you have pinged me, so thanks for
your vigilance.

It's time for another Hadoop/Lucene/Apache Stack meetup! We've had
great attendance in the past few months, let's keep it up! I'm always
amazed by the things I learn from everyone.

We're back at the University of Washington, Allen Computer Science
Center (not Computer Engineering)
Map: http://www.washington.edu/home/maps/?CSE

Room: 303 -or- the Entry level. If there are changes, signs will be  
posted.


More Info:

The meetup is about 2 hours: we'll have two in-depth talks of 15-20
minutes each, and then several lightning talks of 5 minutes. If no
one offers, We'll then have discussion and 'social time'.  we'll just
have general discussion. Let net know if you're interested in speaking
or attending. We'd like to focus on education, so every presentation
*needs* to ask some questions at the end. We can talk about these
after the presentations, and I'll record what we've learned in a wiki
and share that with the rest of us.

Contact: Bradford Stephens, 904-415-3009, bradfordsteph...@gmail.com

--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science


Symlink support

2009-08-26 Thread Yasuyuki Watanabe
Hi!

Could someone tell me about the status of symbolic link support
in HDFS (HDFS-245)?

It looks like a patch is merged with latest trunk. So I would
like to know how good it works and whether or not the patch is
applicable for the current release of Hadoop.

We just start testing HDFS as a part of our ftp mirror site
(http://ftp.kddilabs.jp/). It works fine so far. But if HDFS
supports symlink feature, we could save lots of capacity :-)

Thanks.

Yasu



control map to split assignment

2009-08-26 Thread Rares Vernica
Hello,

I wonder is there is a way to control how maps are assigned to splits
in order to balance the load across the cluster.

Here is a simplified example. I have tow types of inputs: long and
short. Each input is in a different file and will be processed by a
single map task. Suppose the long inputs take 10s to process while
the short inputs take 3s to process. I have two long inputs and
two short inputs. My cluster has 2 nodes and each node can execute
only one map task at a time. A possible schedule of the tasks could be
the following:

Node 1: long map, short map - 10s + 3s = 13s
Node 2: long map, short map - 10s + 3s = 13s

So, my job will be done in 13s. Another possible schedule is:

Node 1: long map - 10s
Node 2: short map, short map, long map - 3s + 3s + 10s = 16s

And, my job will be done in 16s. Clearly, the first scheduling is better.

Is there a way to control how the schedule is build? If I can control
which inputs are processed first, I could schedule the long inputs
to be processed first and so they will be balanced across nodes and I
will end up with something similar to the first schedule.

I could configure the job so that a long input gets processed by
more that a map, and so end up balancing the work, but I noticed that
overall, this takes more time than a bad scheduling with only one map
per input.

Thanks!

Cheers,
Rares Vernica


How does reducer get intermediate output?

2009-08-26 Thread inifok.song
Hi all,

In my cluster, the reducer often can't fetch mapper's output. I know there
are many reasons for this situation. And I think it's necessary to find out
how does reducer get intermediate output. I have read the source code.
However, I'm not clear about the whole process. Could you tell me the
process of it? How does each node communicate with each other and how does
class ReduceCopier work?

Thank you.

Inifok


Re: Concatenating files on HDFS

2009-08-26 Thread Ankur Goel
HDFS files are write once so you cannot append to them (at the moment).
What you can do is copy your local file to HDFS dir containing the same file 
you want to append to.
Once that is done you can run a simple (Identity Mapper  Identity Reducer) 
mapreduce job with input
as this directory and number of reducers = 1. 

- Original Message -
From: Turner Kunkel thkun...@gmail.com
To: core-u...@hadoop.apache.org
Sent: Wednesday, August 26, 2009 10:02:41 PM GMT +05:30 Chennai, Kolkata, 
Mumbai, New Delhi
Subject: Concatenating files on HDFS

Is there any way to concatenate/append a local file to a file on HDFS
without copying down the HDFS file locally first?

I tried:
bin/hadoop dfs -cat file:///[local file]  hdfs://[hdfs file]
But it just tries to look for hdfs://[hdfs file] as a local file,
since I suppose the dfs -cat command doesn't support the  operator.

Thanks.
-- 

-Turner Kunkel