Re: hadoop streaming binary input / image processing

2009-05-14 Thread Piotr Praczyk
Depends what API do you use. When writing an InputSplit implementation, it
is possible to specify on which nodes does the data reside. I am new to
Hadoop, but as far as I know, doing this
should enable the support for data locality. Moreover, implementing a
subclass of TextInputFormat and adding some encoding on the fly should not
impact any locality properties.


Piotr


2009/5/15 jason hadoop 

> A  downside of this approach is that you will not likely have data locality
> for the data on shared file systems, compared with data coming from an
> input
> split.
> That being said,
> from your script, *hadoop dfs -get FILE -* will write the file to standard
> out.
>
> On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk  >wrote:
>
> > just in addition to my previous post...
> >
> > You don't have to store the enceded files in a file system of course
> since
> > you can write your own InoutFormat which wil do this on the fly... the
> > overhead should not be that big.
> >
> > Piotr
> >
> > 2009/5/14 Piotr Praczyk 
> >
> > > Hi
> > >
> > > If you want to read the files form HDFS and can not pass the binary
> data,
> > > you can do some encoding of it (base 64 for example, but you can think
> > about
> > > sth more efficient since the range of characters accprable in the input
> > > string is wider than that used by BASE64). It should solve the problem
> > until
> > > some king of binary input is supported ( is it going to happen? ).
> > >
> > > Piotr
> > >
> > > 2009/5/14 openresearch 
> > >
> > >
> > >> All,
> > >>
> > >> I have read some recommendation regarding image (binary input)
> > processing
> > >> using Hadoop-streaming which only accept text out-of-box for now.
> > >> http://hadoop.apache.org/core/docs/current/streaming.html
> > >> https://issues.apache.org/jira/browse/HADOOP-1722
> > >> http://markmail.org/message/24woaqie2a6mrboc
> > >>
> > >> However, I have not got any straight answer.
> > >>
> > >> One recommendation is to put image data on HDFS, but we have to do
> "hdf
> > >> -get" for each file/dir and process it locally which is every
> expensive.
> > >>
> > >> Another recommendation is to "...put them in a centralized place where
> > all
> > >> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously,
> IO
> > >> will becomes bottleneck and it defeat the purpose of distributed
> > >> processing.
> > >>
> > >> I also notice some enhancement ticket is open for hadoop-core. Is it
> > >> committed to any svn (0.21) branch? can somebody show me an example
> how
> > to
> > >> take *.jpg files (from HDFS), and process files in a distributed
> fashion
> > >> using streaming?
> > >>
> > >> Many thanks
> > >>
> > >> -Qiming
> > >> --
> > >> View this message in context:
> > >>
> >
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> > >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>


Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
There should be a few more lines at the end.
We only want the part from last the STARTUP_MSG to the end

On one of mine a successfull start looks like this:
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = at/192.168.1.119
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.19.1-dev
STARTUP_MSG:   build =  -r ; compiled by 'jason' on Tue Mar 17 04:03:57 PDT
2009
/
2009-03-17 03:08:11,884 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Registered
FSDatasetStatusMBean
2009-03-17 03:08:11,886 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 50010
2009-03-17 03:08:11,889 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is
1048576 bytes/s
2009-03-17 03:08:12,142 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2009-03-17 03:08:12,155 INFO org.mortbay.util.Credential: Checking Resource
aliases
2009-03-17 03:08:12,518 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.servlet.webapplicationhand...@1e184cb
2009-03-17 03:08:12,578 INFO org.mortbay.util.Container: Started
WebApplicationContext[/static,/static]
2009-03-17 03:08:12,721 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.servlet.webapplicationhand...@1d9e282
2009-03-17 03:08:12,722 INFO org.mortbay.util.Container: Started
WebApplicationContext[/logs,/logs]
2009-03-17 03:08:12,878 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.servlet.webapplicationhand...@14a75bb
2009-03-17 03:08:12,884 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2009-03-17 03:08:12,951 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50075
2009-03-17 03:08:12,951 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.ser...@1358f03
2009-03-17 03:08:12,957 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=DataNode, sessionId=null
2009-03-17 03:08:13,242 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=DataNode, port=50020
2009-03-17 03:08:13,264 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2009-03-17 03:08:13,304 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 50020: starting
2009-03-17 03:08:13,343 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 50020: starting
2009-03-17 03:08:13,343 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration =
DatanodeRegistration(192.168.1.119:50010,
storageID=DS-540597485-192.168.1.119-50010-1237022386925, infoPort=50075,
ipcPort=50020)
2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 50020: starting
2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020: starting
2009-03-17 03:08:13,351 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
192.168.1.119:50010,
storageID=DS-540597485-192.168.1.119-50010-1237022386925, infoPort=50075,
ipcPort=50020)In DataNode.run, data =
FSDataset{dirpath='/tmp/hadoop-0.19.0-jason/dfs/data/current'}
2009-03-17 03:08:13,352 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL
of 360msec Initial delay: 0msec
2009-03-17 03:08:13,391 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 14 blocks
got processed in 27 msecs
2009-03-17 03:08:13,392 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block
scanner.



On Thu, May 14, 2009 at 9:51 PM, Pankil Doshi  wrote:

> This is log from datanode.
>
>
> 2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 12 msecs
> 2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 8 msecs
> 2009-05-14 02:36:13,975 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 9 msecs
> 2009-05-14 03:36:15,189 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 12 msecs
> 2009-05-14 04:36:13,384 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 9 msecs
> 2009-05-14 05:36:14,592 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 9 msecs
> 2009-05-14 06:36:15,806 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 12 msecs
> 2009-05-14 07:36:14,008 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 12 msecs
> 2009-05-14 08:36:15,204 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 9 msecs
> 2009-05-14 09:36:13,430 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 12 msecs
> 2009-05-14 10:36:14,642 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 12 msecs
> 2009-05-14 11:36:15,850 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 82 blocks got processed in 9 msecs
> 2009-05-14 12:36:14,193 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
> 8

Re: Datanodes fail to start

2009-05-14 Thread Pankil Doshi
This is log from datanode.


2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 8 msecs
2009-05-14 02:36:13,975 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 9 msecs
2009-05-14 03:36:15,189 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 04:36:13,384 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 9 msecs
2009-05-14 05:36:14,592 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 9 msecs
2009-05-14 06:36:15,806 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 07:36:14,008 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 08:36:15,204 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 9 msecs
2009-05-14 09:36:13,430 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 10:36:14,642 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 11:36:15,850 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 9 msecs
2009-05-14 12:36:14,193 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 11 msecs
2009-05-14 13:36:15,454 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 14:36:13,662 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 9 msecs
2009-05-14 15:36:14,930 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 13 msecs
2009-05-14 16:36:16,151 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 17:36:14,407 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 9 msecs
2009-05-14 18:36:15,659 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 10 msecs
2009-05-14 19:27:02,188 WARN org.apache.hadoop.dfs.DataNode:
java.io.IOException: Call to
hadoopmaster.utdallas.edu/10.110.95.61:9000failed on local except$
at org.apache.hadoop.ipc.Client.wrapException(Client.java:751)
at org.apache.hadoop.ipc.Client.call(Client.java:719)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:690)
at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2967)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:500)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:442)

2009-05-14 19:27:06,198 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: hadoopmaster.utdallas.edu/10.110.95.61:9000. Already tried 0
time(s).
2009-05-14 19:27:06,436 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down DataNode at Slave1/127.0.1.1
/
2009-05-14 19:27:21,737 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = Slave1/127.0.1.1


On Thu, May 14, 2009 at 11:43 PM, jason hadoop wrote:

> The data node logs are on the datanode machines in the log directory.
> You may wish to buy my book and read chapter 4 on hdfs management.
>
> On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi  wrote:
>
> > Can u guide me where can I find datanode log files? As I cannot find it
> in
> > $hadoop/logs and so.
> >
> > I can only find  following files in logs folder :-
> >
> > hadoop-hadoop-namenode-hadoopmaster.log
> >hadoop-hadoop-namenode-hadoopmaster.out
> > hadoop-hadoop-namenode-hadoopmaster.out.1
> >   hadoop-hadoop-secondarynamenode-hadoopmaster.log
> > hadoop-hadoop-secondarynamenode-hadoopmaster.out
> > hadoop-hadoop-secondarynamenode-hadoopmaster.out.1
> >history
> >
> >
> > Thanks
> > Pankil
> >
> > On Thu, May 14, 2009 at 11:27 PM, jason hadoop  > >wrote:
> >
> > > You have to examine the datanode log files
> > > the namenode does not start the datanodes, the start script does.
> > > The name node passively waits for the datanodes to connect to it.
> > >
> > > On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi 
> > wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > Actually I had a cluster which was up.
> > > >
> > > > But i stopped the cluster as i  wanted to format it.But cant start it
> > > back.
> > > >
> > > > 1)when i give "start-dfs.sh" I get follow

Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
The data node logs are on the datanode machines in the log directory.
You may wish to buy my book and read chapter 4 on hdfs management.

On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi  wrote:

> Can u guide me where can I find datanode log files? As I cannot find it in
> $hadoop/logs and so.
>
> I can only find  following files in logs folder :-
>
> hadoop-hadoop-namenode-hadoopmaster.log
>hadoop-hadoop-namenode-hadoopmaster.out
> hadoop-hadoop-namenode-hadoopmaster.out.1
>   hadoop-hadoop-secondarynamenode-hadoopmaster.log
> hadoop-hadoop-secondarynamenode-hadoopmaster.out
> hadoop-hadoop-secondarynamenode-hadoopmaster.out.1
>history
>
>
> Thanks
> Pankil
>
> On Thu, May 14, 2009 at 11:27 PM, jason hadoop  >wrote:
>
> > You have to examine the datanode log files
> > the namenode does not start the datanodes, the start script does.
> > The name node passively waits for the datanodes to connect to it.
> >
> > On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi 
> wrote:
> >
> > > Hello Everyone,
> > >
> > > Actually I had a cluster which was up.
> > >
> > > But i stopped the cluster as i  wanted to format it.But cant start it
> > back.
> > >
> > > 1)when i give "start-dfs.sh" I get following on screen
> > >
> > > starting namenode, logging to
> > >
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
> > > slave1.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out
> > > slave3.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out
> > > slave4.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out
> > > slave2.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out
> > > slave5.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out
> > > slave6.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out
> > > slave9.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out
> > > slave8.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out
> > > slave7.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out
> > > slave10.local: starting datanode, logging to
> > > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out
> > > hadoopmaster.local: starting secondarynamenode, logging to
> > >
> > >
> >
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out
> > >
> > >
> > > 2) from log file named "hadoop-hadoop-namenode-hadoopmaster.log" I get
> > > following
> > >
> > >
> > >
> > > 2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode:
> STARTUP_MSG:
> > > /
> > > STARTUP_MSG: Starting NameNode
> > > STARTUP_MSG:   host = hadoopmaster/127.0.0.1
> > > STARTUP_MSG:   args = []
> > > STARTUP_MSG:   version = 0.18.3
> > > STARTUP_MSG:   build =
> > > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> > > 736250;
> > > compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> > > /
> > > 2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> > > Initializing RPC Metrics with hostName=NameNode, port=9000
> > > 2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode
> up
> > > at:
> > > hadoopmaster.local/192.168.0.1:9000
> > > 2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > > 2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics:
> > > Initializing NameNodeMeterics using context
> > > object:org.apache.hadoop.metrics.spi.NullContext
> > > 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> > >
> > >
> >
> fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare
> > > 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> > > supergroup=supergroup
> > > 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> > > isPermissionEnabled=true
> > > 2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
> > > Initializing FSNamesystemMeterics using context
> > > object:org.apache.hadoop.metrics.spi.NullContext
> > > 2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem:
> > Registered
> > > FSNamesystemStatusMBean
> > > 2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of
> > files
> > > = 1
> > > 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of
> > files
> > > under construction = 0
> > > 2009-05-14 20:28:23,971 INFO org.apach

Re: Datanodes fail to start

2009-05-14 Thread Pankil Doshi
Can u guide me where can I find datanode log files? As I cannot find it in
$hadoop/logs and so.

I can only find  following files in logs folder :-

hadoop-hadoop-namenode-hadoopmaster.log
   hadoop-hadoop-namenode-hadoopmaster.out
hadoop-hadoop-namenode-hadoopmaster.out.1
   hadoop-hadoop-secondarynamenode-hadoopmaster.log
hadoop-hadoop-secondarynamenode-hadoopmaster.out
hadoop-hadoop-secondarynamenode-hadoopmaster.out.1
history


Thanks
Pankil

On Thu, May 14, 2009 at 11:27 PM, jason hadoop wrote:

> You have to examine the datanode log files
> the namenode does not start the datanodes, the start script does.
> The name node passively waits for the datanodes to connect to it.
>
> On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi  wrote:
>
> > Hello Everyone,
> >
> > Actually I had a cluster which was up.
> >
> > But i stopped the cluster as i  wanted to format it.But cant start it
> back.
> >
> > 1)when i give "start-dfs.sh" I get following on screen
> >
> > starting namenode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
> > slave1.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out
> > slave3.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out
> > slave4.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out
> > slave2.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out
> > slave5.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out
> > slave6.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out
> > slave9.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out
> > slave8.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out
> > slave7.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out
> > slave10.local: starting datanode, logging to
> > /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out
> > hadoopmaster.local: starting secondarynamenode, logging to
> >
> >
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out
> >
> >
> > 2) from log file named "hadoop-hadoop-namenode-hadoopmaster.log" I get
> > following
> >
> >
> >
> > 2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
> > /
> > STARTUP_MSG: Starting NameNode
> > STARTUP_MSG:   host = hadoopmaster/127.0.0.1
> > STARTUP_MSG:   args = []
> > STARTUP_MSG:   version = 0.18.3
> > STARTUP_MSG:   build =
> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> > 736250;
> > compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> > /
> > 2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> > Initializing RPC Metrics with hostName=NameNode, port=9000
> > 2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode up
> > at:
> > hadoopmaster.local/192.168.0.1:9000
> > 2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > 2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics:
> > Initializing NameNodeMeterics using context
> > object:org.apache.hadoop.metrics.spi.NullContext
> > 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> >
> >
> fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare
> > 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> > supergroup=supergroup
> > 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> > isPermissionEnabled=true
> > 2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
> > Initializing FSNamesystemMeterics using context
> > object:org.apache.hadoop.metrics.spi.NullContext
> > 2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem:
> Registered
> > FSNamesystemStatusMBean
> > 2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of
> files
> > = 1
> > 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of
> files
> > under construction = 0
> > 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Image file of
> > size 80 loaded in 0 seconds.
> > 2009-05-14 20:28:23,972 INFO org.apache.hadoop.dfs.Storage: Edits file
> > edits
> > of size 4 edits # 0 loaded in 0 seconds.
> > 2009-05-14 20:28:23,974 INFO org.apache.hadoop.fs.FSNamesystem: Finished
> > loading FSImage in 155 msecs
> > 2009-05-14 20:28:23,976 INFO org.apache.hadoop.fs.FSNamesystem: Total
> > number
> > of blocks = 0
> > 2

Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
You have to examine the datanode log files
the namenode does not start the datanodes, the start script does.
The name node passively waits for the datanodes to connect to it.

On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi  wrote:

> Hello Everyone,
>
> Actually I had a cluster which was up.
>
> But i stopped the cluster as i  wanted to format it.But cant start it back.
>
> 1)when i give "start-dfs.sh" I get following on screen
>
> starting namenode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
> slave1.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out
> slave3.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out
> slave4.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out
> slave2.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out
> slave5.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out
> slave6.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out
> slave9.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out
> slave8.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out
> slave7.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out
> slave10.local: starting datanode, logging to
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out
> hadoopmaster.local: starting secondarynamenode, logging to
>
> /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out
>
>
> 2) from log file named "hadoop-hadoop-namenode-hadoopmaster.log" I get
> following
>
>
>
> 2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = hadoopmaster/127.0.0.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.18.3
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> 736250;
> compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> /
> 2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=NameNode, port=9000
> 2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode up
> at:
> hadoopmaster.local/192.168.0.1:9000
> 2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics:
> Initializing NameNodeMeterics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
>
> fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare
> 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> supergroup=supergroup
> 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
> isPermissionEnabled=true
> 2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
> Initializing FSNamesystemMeterics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of files
> = 1
> 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of files
> under construction = 0
> 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Image file of
> size 80 loaded in 0 seconds.
> 2009-05-14 20:28:23,972 INFO org.apache.hadoop.dfs.Storage: Edits file
> edits
> of size 4 edits # 0 loaded in 0 seconds.
> 2009-05-14 20:28:23,974 INFO org.apache.hadoop.fs.FSNamesystem: Finished
> loading FSImage in 155 msecs
> 2009-05-14 20:28:23,976 INFO org.apache.hadoop.fs.FSNamesystem: Total
> number
> of blocks = 0
> 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
> invalid blocks = 0
> 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
> under-replicated blocks = 0
> 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
> over-replicated blocks = 0
> 2009-05-14 20:28:23,988 INFO org.apache.hadoop.dfs.StateChange: STATE*
> Leaving safe mode after 0 secs.
> *2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE*
> Network topology has 0 racks and 0 datanodes*
> 2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE*
> UnderReplicatedBlocks has 0 blocks
> 2009-05-14 20:28:29,128 INFO org.mortbay.util.Credential: Checking Resourc

Re: hadoop streaming binary input / image processing

2009-05-14 Thread jason hadoop
A  downside of this approach is that you will not likely have data locality
for the data on shared file systems, compared with data coming from an input
split.
That being said,
from your script, *hadoop dfs -get FILE -* will write the file to standard
out.

On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk wrote:

> just in addition to my previous post...
>
> You don't have to store the enceded files in a file system of course since
> you can write your own InoutFormat which wil do this on the fly... the
> overhead should not be that big.
>
> Piotr
>
> 2009/5/14 Piotr Praczyk 
>
> > Hi
> >
> > If you want to read the files form HDFS and can not pass the binary data,
> > you can do some encoding of it (base 64 for example, but you can think
> about
> > sth more efficient since the range of characters accprable in the input
> > string is wider than that used by BASE64). It should solve the problem
> until
> > some king of binary input is supported ( is it going to happen? ).
> >
> > Piotr
> >
> > 2009/5/14 openresearch 
> >
> >
> >> All,
> >>
> >> I have read some recommendation regarding image (binary input)
> processing
> >> using Hadoop-streaming which only accept text out-of-box for now.
> >> http://hadoop.apache.org/core/docs/current/streaming.html
> >> https://issues.apache.org/jira/browse/HADOOP-1722
> >> http://markmail.org/message/24woaqie2a6mrboc
> >>
> >> However, I have not got any straight answer.
> >>
> >> One recommendation is to put image data on HDFS, but we have to do "hdf
> >> -get" for each file/dir and process it locally which is every expensive.
> >>
> >> Another recommendation is to "...put them in a centralized place where
> all
> >> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
> >> will becomes bottleneck and it defeat the purpose of distributed
> >> processing.
> >>
> >> I also notice some enhancement ticket is open for hadoop-core. Is it
> >> committed to any svn (0.21) branch? can somebody show me an example how
> to
> >> take *.jpg files (from HDFS), and process files in a distributed fashion
> >> using streaming?
> >>
> >> Many thanks
> >>
> >> -Qiming
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can have separate configuration files for the different datanodes.

If you are willing to deal with the complexity you can manually start them
with altered properties from the command line.

rsync or other means of sharing  identical configs is simple and common.

Raghu, your technique will only work well if you can complete steps 1-4 in
less than the datanode timeout interval, which may be valid for Alexandria.
I believe the timeout is 10 minutes.
If you pass the timeout interval the namenode will start to rebalance the
blocks, and when the datanode comes back it will delete all of the blocks it
has rebalanced.

On Thu, May 14, 2009 at 11:35 AM, Raghu Angadi wrote:

>
> Along these lines, even simpler approach I would think is :
>
> 1) set data.dir to local and create the data.
> 2) stop the datanode
> 3) rsync local_dir network_dir
> 4) start datanode with data.dir with network_dir
>
> There is no need to format or rebalnace.
>
> This way you can switch between local and network multiple times (without
> needing to rsync data, if there are no changes made in the tests)
>
> Raghu.
>
>
> Alexandra Alecu wrote:
>
>> Another possibility I am thinking about now, which is suitable for me as I
>> do
>> not actually have much data stored in the cluster when I want to perform
>> this switch is to set the replication level really high and then simply
>> remove the local storage locations and restart the cluster. With a bit of
>> luck the high level of replication will allow a full recovery of the
>> cluster
>> on restart.
>>
>> Is this something that you would advice?
>>
>> Many thanks,
>> Alexandra.
>>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
In the mapside join, the input file name is not visible. as the input is
actually a composite a large number of files.

I have started answering in www.prohadoopbook.com

On Thu, May 14, 2009 at 1:19 PM, Stuart White wrote:

> On Thu, May 14, 2009 at 10:25 AM, jason hadoop 
> wrote:
> > If you put up a discussion question on www.prohadoopbook.com, I will
> fill in
> > the example on how to do this.
>
> Done.  Thanks!
>
> http://www.prohadoopbook.com/forum/topics/preserving-partition-file
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Datanodes fail to start

2009-05-14 Thread Pankil Doshi
Hello Everyone,

Actually I had a cluster which was up.

But i stopped the cluster as i  wanted to format it.But cant start it back.

1)when i give "start-dfs.sh" I get following on screen

starting namenode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
slave1.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out
slave3.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out
slave4.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out
slave2.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out
slave5.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out
slave6.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out
slave9.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out
slave8.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out
slave7.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out
slave10.local: starting datanode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out
hadoopmaster.local: starting secondarynamenode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out


2) from log file named "hadoop-hadoop-namenode-hadoopmaster.log" I get
following



2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoopmaster/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250;
compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
/
2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode up at:
hadoopmaster.local/192.168.0.1:9000
2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics:
Initializing NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare
2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
supergroup=supergroup
2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
isPermissionEnabled=true
2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
Initializing FSNamesystemMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem: Registered
FSNamesystemStatusMBean
2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of files
= 1
2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of files
under construction = 0
2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Image file of
size 80 loaded in 0 seconds.
2009-05-14 20:28:23,972 INFO org.apache.hadoop.dfs.Storage: Edits file edits
of size 4 edits # 0 loaded in 0 seconds.
2009-05-14 20:28:23,974 INFO org.apache.hadoop.fs.FSNamesystem: Finished
loading FSImage in 155 msecs
2009-05-14 20:28:23,976 INFO org.apache.hadoop.fs.FSNamesystem: Total number
of blocks = 0
2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
invalid blocks = 0
2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
under-replicated blocks = 0
2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
over-replicated blocks = 0
2009-05-14 20:28:23,988 INFO org.apache.hadoop.dfs.StateChange: STATE*
Leaving safe mode after 0 secs.
*2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE*
Network topology has 0 racks and 0 datanodes*
2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE*
UnderReplicatedBlocks has 0 blocks
2009-05-14 20:28:29,128 INFO org.mortbay.util.Credential: Checking Resource
aliases
2009-05-14 20:28:29,243 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2009-05-14 20:28:29,244 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2009-05-14 20:28:29,245 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2009-05-14 20:28:29,750 INFO org.mortbay.util.Container: Started
org.mortbay.jetty.servlet.webapplicationhand...@7fcebc9f
2009-05-14 20:28:29,838 INFO

Re: Large number of map output keys and performance issues.

2009-05-14 Thread Chuck Lam
just thinking out loud here to see if anything hits a chord.

since you're talking about an access log, i imagine the data is pretty
skewed. i.e., a good percentage of the access is for one resource. if you
use resource id as key, that means a good percentage of the intermediate
data is shuffled to just one reducer.

now the usual solution is to use a combiner, and it seems like you've done
that already. however, given that you're only using 3 task trackers, there's
still some freak chance that one reducer ends up doing most of the work.

anyways, a quick way to check this is to set your reducer to the
IdentityReducer and run the job again. This way you see the input to each
reducer and can see if they're balanced or not.




On Thu, May 14, 2009 at 1:04 PM, Tiago Macambira wrote:

> On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon  wrote:
> > Hi Tiago,
>
> Hi there.
>
> First of all, sorry for the late reply --- I was investigating the
> issue further before replying.
>
> Just to make the whole thing clear(er), let me add some numbers and
> explain my problem.
>
> I have a ~80GB sequence file holding entries for 3 million users,
> regarding how many times those said users accessed aprox. 160 million
> distinct resources. There were aprox. 5000 million accesses for those
> resources. Each entry in the seq. file is encoded using google
> protocol buffers and compressed with gzip (don't ask...). I have to
> extract some metrics from this data. To start, I've chosen to rank
> resources based on the number of accesses.
>
> I thought that this would be a pretty simple MR application to write
> and run. A beefed up WordCount, if I may:
>
> Map:
> for each user:
>for each resource_id accessed by current user:
># resource.id is a LongWritable
>collect(resource.id, 1)
>
> Combine and Reduce work just as in WordCount, except that keys and
> values are both LongWritables. The final step to calculate the ranking
> --- sorting the resources based on their accumulated access count ---
> is done using the unix sort command. Nothing really fancy here.
>
> This mapreduce consumed aprox. 2 hours to run --- I said 4 hours in
> the previous e-mail, sorry :-) IMBW, but it seems quite a long time to
> compute a ranking. I coded a similar application in a filter-stream
> framework and it took less than half an hour to run -- even with most
> of its data being read from the network.
>
> So, what I was wondering is: what am I doing wrong?
>  -  Is it just a matter of fine-tuning my hadoop cluster setup?
>  -  This is a valid MR application, right?
>  -  Is it just that I have too few "IO units"? I'm using 4 DataNodes
> and 3 TaskTrackers (dual octacores).
>
> Now, back for our regular programming...
>
>
> > Here are a couple thoughts:
> >
> > 1) How much data are you outputting? Obviously there is a certain amount
> of
> > IO involved in actually outputting data versus not ;-)
>
> Well, the map phase is outputting 180GB of data for aprox. 1000
> million intermediate keys. I know it is going to take some time to
> save this amount of data to disk but yet... this much time?
>
>
> > 2) Are you using a reduce phase in this job? If so, since you're cutting
> off
> > the data at map output time, you're also avoiding a whole sort
> computation
> > which involves significant network IO, etc.
>
> Yes, a Reduce and a Combine phases. From the original 5000 million,
> the combine outputs 1000 million and the reduce ends outputting (the
> expected) 160 million keys.
>
>
> > 3) What version of Hadoop are you running?
>
> 0.18.3
>
> Cheers.
>
> Tiago Alves Macambira
> --
> I may be drunk, but in the morning I will be sober, while you will
> still be stupid and ugly." -Winston Churchill
>


Re: Fast upload of input data to S3?

2009-05-14 Thread Jeff Hammerbacher
http://www.freedomoss.com/clouddataingestion?

On Thu, May 14, 2009 at 1:23 PM, Peter Skomoroch
wrote:

> Does anyone have upload performance numbers to share or suggested utilities
> for uploading Hadoop input data to S3 for an EC2 cluster?
>
> I'm finding EBS volume transfer to HDFS via put to be extremely slow...
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>


Task process exit with nonzero status of 1

2009-05-14 Thread g00dn3ss
Hey All,
I am running Hadoop 0.19.1.  One of my Mapper tasks was failing and the
problem that was reported was:

  Task process exit with nonzero status of 1...

Looking through the mailing list archives, I got the impression that this
was only caused by a JVM crash.

After much hair pulling, I figured out that the problem was actually caused
by an exception thrown in my Mapper's close() method. I was catching the
initial exception, printing a message and a stack trace, and rethrowing the
exception as a runtime exception. But my printouts never showed up. Also,
when the task was restarted after the initial failure, the task would always
fail again but in seemingly random places. Sometimes not printing anything
at all in the logs.

I'm not sure if throwing runtime exceptions in the close() method is a
discouraged practice for Hadoop Mappers. In any case, I thought I'd report
my experience in case it helps anyone else.

g00dn3ss


Fast upload of input data to S3?

2009-05-14 Thread Peter Skomoroch
Does anyone have upload performance numbers to share or suggested utilities
for uploading Hadoop input data to S3 for an EC2 cluster?

I'm finding EBS volume transfer to HDFS via put to be extremely slow...

-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Map-side join: Sort order preserved?

2009-05-14 Thread Stuart White
On Thu, May 14, 2009 at 10:25 AM, jason hadoop  wrote:
> If you put up a discussion question on www.prohadoopbook.com, I will fill in
> the example on how to do this.

Done.  Thanks!

http://www.prohadoopbook.com/forum/topics/preserving-partition-file


Hadoop JMX and Cacti

2009-05-14 Thread Edward Capriolo
Hey all,
I have come pretty far along with using cacti to graph Hadoop JMX
variables using caciti. http://www.jointhegrid.com/hadoop/. Currently
I have about 8 different hadoop graph types available for the NameNode
and the DataNode.

The NameNode has many fairly complete and detailed counters. I have
been browsing the Hadoop Core Jira, Metrics on Secondary namenode's
activity HADOOP-3990. I also notice that JobTracker in version 19.1
Has no JMX attributes. Also It seems like some of the JMX attributes
that exists currently are non functional.

I did not just want create multiple Jira for each JMX attribute that I
found not to be working. I want to get some confirmation that I am not
doing something wrong. I figured, I would wave a flag here and see who
is currently involved or interested. I have some experience with
hadoop internals and would like to get involved in implementing some
of the JMX objects. Where to start?


Re: Large number of map output keys and performance issues.

2009-05-14 Thread Tiago Macambira
On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon  wrote:
> Hi Tiago,

Hi there.

First of all, sorry for the late reply --- I was investigating the
issue further before replying.

Just to make the whole thing clear(er), let me add some numbers and
explain my problem.

I have a ~80GB sequence file holding entries for 3 million users,
regarding how many times those said users accessed aprox. 160 million
distinct resources. There were aprox. 5000 million accesses for those
resources. Each entry in the seq. file is encoded using google
protocol buffers and compressed with gzip (don't ask...). I have to
extract some metrics from this data. To start, I've chosen to rank
resources based on the number of accesses.

I thought that this would be a pretty simple MR application to write
and run. A beefed up WordCount, if I may:

Map:
for each user:
for each resource_id accessed by current user:
# resource.id is a LongWritable
collect(resource.id, 1)

Combine and Reduce work just as in WordCount, except that keys and
values are both LongWritables. The final step to calculate the ranking
--- sorting the resources based on their accumulated access count ---
is done using the unix sort command. Nothing really fancy here.

This mapreduce consumed aprox. 2 hours to run --- I said 4 hours in
the previous e-mail, sorry :-) IMBW, but it seems quite a long time to
compute a ranking. I coded a similar application in a filter-stream
framework and it took less than half an hour to run -- even with most
of its data being read from the network.

So, what I was wondering is: what am I doing wrong?
  -  Is it just a matter of fine-tuning my hadoop cluster setup?
  -  This is a valid MR application, right?
  -  Is it just that I have too few "IO units"? I'm using 4 DataNodes
and 3 TaskTrackers (dual octacores).

Now, back for our regular programming...


> Here are a couple thoughts:
>
> 1) How much data are you outputting? Obviously there is a certain amount of
> IO involved in actually outputting data versus not ;-)

Well, the map phase is outputting 180GB of data for aprox. 1000
million intermediate keys. I know it is going to take some time to
save this amount of data to disk but yet... this much time?


> 2) Are you using a reduce phase in this job? If so, since you're cutting off
> the data at map output time, you're also avoiding a whole sort computation
> which involves significant network IO, etc.

Yes, a Reduce and a Combine phases. From the original 5000 million,
the combine outputs 1000 million and the reduce ends outputting (the
expected) 160 million keys.


> 3) What version of Hadoop are you running?

0.18.3

Cheers.

Tiago Alves Macambira
--
I may be drunk, but in the morning I will be sober, while you will
still be stupid and ugly." -Winston Churchill


Re: public IP for datanode on EC2

2009-05-14 Thread Raghu Angadi

Philip Zeyliger wrote:


You could use ssh to set up a SOCKS proxy between your machine and
ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
socket factory.
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
has more information.


very useful write up. Regd the problem with reverse DNS mentioned (thats 
why you had to add a DNS record for internal ip) it is fixed in 
https://issues.apache.org/jira/browse/HADOOP-5191 (for HDFS access 
least). Some mapred parts are still affected (HADOOP-5610). Depending on 
reverse DNS should avoided.


Ideally setting fs.default.name to internal ip should just work for 
clients.. both internally and externally (through proxies).


Raghu.


RE: public IP for datanode on EC2

2009-05-14 Thread Joydeep Sen Sarma
Btw - I figured out the problem.

The jobconf from the remote client had the socks proxy configuration - the jvm 
spawned by TTs picked this up and tried to connect using the proxy which of 
course didn't work.

This was easy to solve - just had to make the remote initialization script mark 
hadoop.rpc.socket.factory.class.default as final variable in the 
hadoop-site.xml on server side.

I am assuming that this would be a good thing to do in general (can't believe 
why server side traffic would be routed through a proxy!). 

Filed https://issues.apache.org/jira/browse/HADOOP-5839 to follow up on the 
issues uncovered here.

-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Thursday, May 14, 2009 7:07 AM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

Yes, you're absolutely right.

Tom

On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma  wrote:
> The ec2 documentation point to the use of public 'ip' addresses - whereas 
> using public 'hostnames' seems safe since it resolves to internal addresses 
> from within the cluster (and resolve to public ip addresses from outside).
>
> The only data transfer that I would incur while submitting jobs from outside 
> is the cost of copying the jar files and any other files meant for the 
> distributed cache). That would be extremely small.
>
>
> -Original Message-
> From: Tom White [mailto:t...@cloudera.com]
> Sent: Thursday, May 14, 2009 5:58 AM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> Hi Joydeep,
>
> The problem you are hitting may be because port 50001 isn't open,
> whereas from within the cluster any node may talk to any other node
> (because the security groups are set up to do this).
>
> However I'm not sure this is a good approach. Configuring Hadoop to
> use public IP addresses everywhere should work, but you have to pay
> for all data transfer between nodes (see http://aws.amazon.com/ec2/,
> "Public and Elastic IP Data Transfer"). This is going to get expensive
> fast!
>
> So to get this to work well, we would have to make changes to Hadoop
> so it was aware of both public and private addresses, and use the
> appropriate one: clients would use the public address, while daemons
> would use the private address. I haven't looked at what it would take
> to do this or how invasive it would be.
>
> Cheers,
> Tom
>
> On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma  
> wrote:
>> I changed the ec2 scripts to have fs.default.name assigned to the public 
>> hostname (instead of the private hostname).
>>
>> Now I can submit jobs remotely via the socks proxy (the problem below is 
>> resolved) - but the map tasks fail with an exception:
>>
>>
>> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
>> to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
>> Already tried 9 time(s).
>> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
>> running child
>> java.io.IOException: Call to 
>> ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
>> local exception: Connection refused
>>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy1.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:177)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>>        at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>>
>>
>> strangely enough - job submissions from nodes within the ec2 cluster work 
>> just fine. I looked at the job.xml files of jobs submitted locally and 
>> remotely and don't see any relevant differences.
>>
>> Totally foxed now.
>>
>> Joydeep
>>
>> -Original Message-
>> From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
>> Sent: Wednesday, May 13, 2009 9:38 PM
>> To: core-user@hadoop.apache.org
>> Cc: Tom White
>> Subject: RE: public IP for datanode on EC2
>>
>> Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
>> dfs command line operations work just fine.
>>
>> However - I am hitting a new error during job submission (running 
>> hadoop-0.19.0):
>>
>> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
>> (SessionState.java:printError(279)) - Job Submission failed with exception 
>> 'java.net.UnknownHostException(unknown host: 

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread Raghu Angadi


Along these lines, even simpler approach I would think is :

1) set data.dir to local and create the data.
2) stop the datanode
3) rsync local_dir network_dir
4) start datanode with data.dir with network_dir

There is no need to format or rebalnace.

This way you can switch between local and network multiple times 
(without needing to rsync data, if there are no changes made in the tests)


Raghu.

Alexandra Alecu wrote:

Another possibility I am thinking about now, which is suitable for me as I do
not actually have much data stored in the cluster when I want to perform
this switch is to set the replication level really high and then simply
remove the local storage locations and restart the cluster. With a bit of
luck the high level of replication will allow a full recovery of the cluster
on restart.

Is this something that you would advice?

Many thanks,
Alexandra.




Re: Shorten interval between datanode going down and being detected as dead by namenode?

2009-05-14 Thread Todd Lipcon
Hi Nesvarbu,

It sounds like your problem might be related to the following JIRA:

https://issues.apache.org/jira/browse/HADOOP-5713

Here's the relevant code from FSNamesystem.java:

long heartbeatInterval = conf.getLong("dfs.heartbeat.interval", 3) *
1000;
this.heartbeatRecheckInterval = conf.getInt(
"heartbeat.recheck.interval", 5 * 60 * 1000); // 5 minutes
this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval +
  10 * heartbeatInterval;

It looks like you specified dfs.heartbeat.recheck.interval instead of
heartbeat.recheck.interval. This inconsistency is unfortunate :(

-Todd

On Fri, May 8, 2009 at 2:13 PM, nesvarbu No  wrote:

> Hi All,
>
> I've been testing hdfs with 3 datanodes cluster, and I've noticed that if I
> stopped 1 datanode I still can read all the files, but "hadoop dfs
> -copyFromLocal" command fails. In the namenode web interface I can see that
> it still thinks that datanode is alive and basically detects that it's dead
> in 10 minutes. After reading list archives I've tried modifying heartbeat
> intervals, by using these options:
>
> 
>  dfs.heartbeat.interval
>  1
>  Determines datanode heartbeat interval in
> seconds.
> 
>
> 
>  dfs.heartbeat.recheck.interval
>  1
>  Determines datanode heartbeat interval in
> seconds.
> 
>
> 
>  dfs.namenode.decommission.interval
>  1
>  Determines datanode heartbeat interval in
> seconds.
> 
>
> It still detects in 10 minutes. Is there a way to shorten this interval? (I
> thought if I set data replication to 2, and have 3 nodes (basically have
> one
> spare) writes won't fail, but they still do fail.)
>


RE: Setting up another machine as secondary node

2009-05-14 Thread Koji Noguchi
Before 0.19, fsimage/edits were on the same directory.
So whenever secondary finishes checkpointing, it copies back the fsimage
while namenode still kept on writing to the edits file.

Usually we observed some latency on the namenode side during that time.

HADOOP-3948 would probably help after 0.19 or later.

Koji

-Original Message-
From: Brian Bockelman [mailto:bbock...@cse.unl.edu] 
Sent: Thursday, May 14, 2009 10:32 AM
To: core-user@hadoop.apache.org
Subject: Re: Setting up another machine as secondary node

Hey Koji,

It's an expensive operation - for the secondary namenode, not the  
namenode itself, right?  I don't particularly care if I stress out a  
dedicated node that doesn't have to respond to queries ;)

Locally we checkpoint+backup fairly frequently (not 5 minutes ...  
maybe less than the default hour) due to sheer paranoia of losing  
metadata.

Brian

On May 14, 2009, at 12:25 PM, Koji Noguchi wrote:

>> The secondary namenode takes a snapshot
>> at 5 minute (configurable) intervals,
>>
> This is a bit too aggressive.
> Checkpointing is still an expensive operation.
> I'd say every hour or even every day.
>
> Isn't the default 3600 seconds?
>
> Koji
>
> -Original Message-
> From: jason hadoop [mailto:jason.had...@gmail.com]
> Sent: Thursday, May 14, 2009 7:46 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Setting up another machine as secondary node
>
> any machine put in the conf/masters file becomes a secondary namenode.
>
> At some point there was confusion on the safety of more than one
> machine,
> which I believe was settled, as many are safe.
>
> The secondary namenode takes a snapshot at 5 minute (configurable)
> intervals, rebuilds the fsimage and sends that back to the namenode.
> There is some performance advantage of having it on the local machine,
> and
> some safety advantage of having it on an alternate machine.
> Could someone who remembers speak up on the single vrs multiple
> secondary
> namenodes?
>
>
> On Thu, May 14, 2009 at 6:07 AM, David Ritch 
> wrote:
>
>> First of all, the secondary namenode is not a what you might think a
>> secondary is - it's not failover device.  It does make a copy of the
>> filesystem metadata periodically, and it integrates the edits into  
>> the
>> image.  It does *not* provide failover.
>>
>> Second, you specify its IP address in hadoop-site.xml.  This is where
> you
>> can override the defaults set in hadoop-default.xml.
>>
>> dbr
>>
>> On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani
> >> wrote:
>>
>>> Hi,
>>>I wanna set up a cluster of 5 nodes in such a way that
>>> node1 - master
>>> node2 - secondary namenode
>>> node3 - slave
>>> node4 - slave
>>> node5 - slave
>>>
>>>
>>> How do we go about that?
>>> there is no property in hadoop-env where i can set the ip-address
> for
>>> secondary name node.
>>>
>>> if i set node-1 and node-2 in masters, and when we start dfs, in
> both the
>>> m/cs, the namenode n secondary namenode processes r present. but i
> think
>>> only node1 is active.
>>> n my namenode fail over operation fails.
>>>
>>> ny suggesstions?
>>>
>>> Regards,
>>> Rakhi
>>>
>>
>
>
>
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals



Re: Setting up another machine as secondary node

2009-05-14 Thread Brian Bockelman

Hey Koji,

It's an expensive operation - for the secondary namenode, not the  
namenode itself, right?  I don't particularly care if I stress out a  
dedicated node that doesn't have to respond to queries ;)


Locally we checkpoint+backup fairly frequently (not 5 minutes ...  
maybe less than the default hour) due to sheer paranoia of losing  
metadata.


Brian

On May 14, 2009, at 12:25 PM, Koji Noguchi wrote:


The secondary namenode takes a snapshot
at 5 minute (configurable) intervals,


This is a bit too aggressive.
Checkpointing is still an expensive operation.
I'd say every hour or even every day.

Isn't the default 3600 seconds?

Koji

-Original Message-
From: jason hadoop [mailto:jason.had...@gmail.com]
Sent: Thursday, May 14, 2009 7:46 AM
To: core-user@hadoop.apache.org
Subject: Re: Setting up another machine as secondary node

any machine put in the conf/masters file becomes a secondary namenode.

At some point there was confusion on the safety of more than one
machine,
which I believe was settled, as many are safe.

The secondary namenode takes a snapshot at 5 minute (configurable)
intervals, rebuilds the fsimage and sends that back to the namenode.
There is some performance advantage of having it on the local machine,
and
some safety advantage of having it on an alternate machine.
Could someone who remembers speak up on the single vrs multiple
secondary
namenodes?


On Thu, May 14, 2009 at 6:07 AM, David Ritch 
wrote:


First of all, the secondary namenode is not a what you might think a
secondary is - it's not failover device.  It does make a copy of the
filesystem metadata periodically, and it integrates the edits into  
the

image.  It does *not* provide failover.

Second, you specify its IP address in hadoop-site.xml.  This is where

you

can override the defaults set in hadoop-default.xml.

dbr

On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani


wrote:



Hi,
   I wanna set up a cluster of 5 nodes in such a way that
node1 - master
node2 - secondary namenode
node3 - slave
node4 - slave
node5 - slave


How do we go about that?
there is no property in hadoop-env where i can set the ip-address

for

secondary name node.

if i set node-1 and node-2 in masters, and when we start dfs, in

both the

m/cs, the namenode n secondary namenode processes r present. but i

think

only node1 is active.
n my namenode fail over operation fails.

ny suggesstions?

Regards,
Rakhi







--
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals




RE: Setting up another machine as secondary node

2009-05-14 Thread Koji Noguchi
> The secondary namenode takes a snapshot 
> at 5 minute (configurable) intervals,
>
This is a bit too aggressive.
Checkpointing is still an expensive operation.
I'd say every hour or even every day.

Isn't the default 3600 seconds?

Koji

-Original Message-
From: jason hadoop [mailto:jason.had...@gmail.com] 
Sent: Thursday, May 14, 2009 7:46 AM
To: core-user@hadoop.apache.org
Subject: Re: Setting up another machine as secondary node

any machine put in the conf/masters file becomes a secondary namenode.

At some point there was confusion on the safety of more than one
machine,
which I believe was settled, as many are safe.

The secondary namenode takes a snapshot at 5 minute (configurable)
intervals, rebuilds the fsimage and sends that back to the namenode.
There is some performance advantage of having it on the local machine,
and
some safety advantage of having it on an alternate machine.
Could someone who remembers speak up on the single vrs multiple
secondary
namenodes?


On Thu, May 14, 2009 at 6:07 AM, David Ritch 
wrote:

> First of all, the secondary namenode is not a what you might think a
> secondary is - it's not failover device.  It does make a copy of the
> filesystem metadata periodically, and it integrates the edits into the
> image.  It does *not* provide failover.
>
> Second, you specify its IP address in hadoop-site.xml.  This is where
you
> can override the defaults set in hadoop-default.xml.
>
> dbr
>
> On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani
 >wrote:
>
> > Hi,
> > I wanna set up a cluster of 5 nodes in such a way that
> > node1 - master
> > node2 - secondary namenode
> > node3 - slave
> > node4 - slave
> > node5 - slave
> >
> >
> > How do we go about that?
> > there is no property in hadoop-env where i can set the ip-address
for
> > secondary name node.
> >
> > if i set node-1 and node-2 in masters, and when we start dfs, in
both the
> > m/cs, the namenode n secondary namenode processes r present. but i
think
> > only node1 is active.
> > n my namenode fail over operation fails.
> >
> > ny suggesstions?
> >
> > Regards,
> > Rakhi
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


RE: Map-side join: Sort order preserved?

2009-05-14 Thread Jingkei Ly
You can also get the input file name with conf.get("map.input.file") and
reuse the last part of the filename (i.e. part-0) with the
OutputCommitter.

-Original Message-
From: jason hadoop [mailto:jason.had...@gmail.com] 
Sent: 14 May 2009 16:25
To: core-user@hadoop.apache.org
Subject: Re: Map-side join: Sort order preserved?

Sort order is preserved if your Mapper doesn't change the key ordering
in
output. Partition name is not preserved.

What I have done is to manually work out what the partition number of
the
output file should be for each map task, by calling the partitioner on
an
input key, and then renaming the output in the close method.

Conceptually the place for this dance is in the OutputCommitter, but I
haven't used them in production code, and my mapside join examples come
from
before they were available.

the Hadoop join framework handles setting the split size to
Long.MAX_VALUE
for you.

If you put up a discussion question on www.prohadoopbook.com, I will
fill in
the example on how to do this.

On Thu, May 14, 2009 at 8:04 AM, Stuart White
wrote:

> I'm implementing a map-side join as described in chapter 8 of "Pro
> Hadoop".  I have two files that have been partitioned using the
> TotalOrderPartitioner on the same key into the same number of
> partitions.  I've set mapred.min.split.size to Long.MAX_VALUE so that
> one Mapper will handle an entire partition.
>
> I want the output to be written in the same partitioned, total sort
> order.  If possible, I want to accomplish this by setting my
> NumReducers to 0 and having the output of my Mappers written directly
> to HDFS, thereby skipping the partition/sort step.
>
> My question is this: Am I guaranteed that the Mapper that processes
> part-0 will have its output written to the output file named
> part-0, the Mapper that processes part-1 will have its output
> written to part-1, etc... ?
>
> If so, then I can preserve the partitioning/sort order of my input
> files without re-partitioning and re-sorting.
>
> Thanks.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals



This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by 
an authorised signatory.  The contents of this email may relate to dealings 
with other companies within the Detica Group plc group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.




Re: hadoop streaming binary input / image processing

2009-05-14 Thread Piotr Praczyk
just in addition to my previous post...

You don't have to store the enceded files in a file system of course since
you can write your own InoutFormat which wil do this on the fly... the
overhead should not be that big.

Piotr

2009/5/14 Piotr Praczyk 

> Hi
>
> If you want to read the files form HDFS and can not pass the binary data,
> you can do some encoding of it (base 64 for example, but you can think about
> sth more efficient since the range of characters accprable in the input
> string is wider than that used by BASE64). It should solve the problem until
> some king of binary input is supported ( is it going to happen? ).
>
> Piotr
>
> 2009/5/14 openresearch 
>
>
>> All,
>>
>> I have read some recommendation regarding image (binary input) processing
>> using Hadoop-streaming which only accept text out-of-box for now.
>> http://hadoop.apache.org/core/docs/current/streaming.html
>> https://issues.apache.org/jira/browse/HADOOP-1722
>> http://markmail.org/message/24woaqie2a6mrboc
>>
>> However, I have not got any straight answer.
>>
>> One recommendation is to put image data on HDFS, but we have to do "hdf
>> -get" for each file/dir and process it locally which is every expensive.
>>
>> Another recommendation is to "...put them in a centralized place where all
>> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
>> will becomes bottleneck and it defeat the purpose of distributed
>> processing.
>>
>> I also notice some enhancement ticket is open for hadoop-core. Is it
>> committed to any svn (0.21) branch? can somebody show me an example how to
>> take *.jpg files (from HDFS), and process files in a distributed fashion
>> using streaming?
>>
>> Many thanks
>>
>> -Qiming
>> --
>> View this message in context:
>> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>


Re: Infinite Loop Resending status from task tracker

2009-05-14 Thread Lance Riedel
Sorry, had missed that Todd had created Jira -
HADOOP-5761

Any progress there?

Thanks,
Lance

On Thu, May 14, 2009 at 8:52 AM, Lance Riedel  wrote:

> Here is the point in the logs where the infinite loop begins - see time
> stamp 2009-05-14 04:03:56,348 : (JobTracker)
>
> 2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200905122015_1168_m_29_0' from
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,326 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_07_0' to tip
> task_200905122015_1183_r_07, for tracker
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:58504'
> 2009-05-14 04:03:56,327 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (cleanup)'attempt_200905122015_1184_m_00_1' to tip
> task_200905122015_1184_m_00, for tracker
> 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
> 127.0.0.1
> :57741'
> 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_200905122015_1182_r_11_0' has completed
> task_200905122015_1182_r_11 successfully.
> 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (cleanup)'attempt_200905122015_1182_r_10_1' to tip
> task_200905122015_1182_r_10, for tracker
> 'tracker_domU-12-31-38-01-5C-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1
> :46248'
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_0499_r_04_1
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_200905122015_1184_m_04_1' has completed
> task_200905122015_1184_m_04 successfully.
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.ResourceEstimator:
> measured blowup on task_200905122015_1184_m_04 was 20150008/21581175 =
> 0.93368447269437372009-05-14 04:03:56,331 INFO
> org.apache.hadoop.mapred.ResourceEstimator: new estimate is blowup =
> 0.9152383292400812
> 2009-05-14 04:03:56,335 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_08_0' to tip
> task_200905122015_1183_r_08, for tracker
> 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
> 127.0.0.1:57741'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_09_0' to tip
> task_200905122015_1183_r_09, for tracker
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200905122015_1181_r_09_0' from
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_0499_r_04_1
> 2009-05-14 04:03:56,337 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_10_0' to tip
> task_200905122015_1183_r_10, for tracker
> 'tracker_domU-12-31-38-01-81-31.compute-1.internal:localhost.localdomain/
> 127.0.0.1:46518'
> 2009-05-14 04:03:56,343 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_1070_r_14_1
> 2009-05-14 04:03:56,344 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_11_0' to tip
> task_200905122015_1183_r_11, for tracker
> 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
> 127.0.0.1:33929'
> 2009-05-14 04:03:56,348 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 54311, call
> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@a4fe4, false, true,
> 21461) from 10.253.178.95:50709: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
> at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
> at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown
> Source)at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-05-14 04:03:56,351 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_200905122015_1183_r_07_0 is 24 ms debug.
> 2009-05-14 04:03:56,419 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_12_0' to tip
> task_200905122015

Re: hadoop streaming binary input / image processing

2009-05-14 Thread Piotr Praczyk
Hi

If you want to read the files form HDFS and can not pass the binary data,
you can do some encoding of it (base 64 for example, but you can think about
sth more efficient since the range of characters accprable in the input
string is wider than that used by BASE64). It should solve the problem until
some king of binary input is supported ( is it going to happen? ).

Piotr

2009/5/14 openresearch 

>
> All,
>
> I have read some recommendation regarding image (binary input) processing
> using Hadoop-streaming which only accept text out-of-box for now.
> http://hadoop.apache.org/core/docs/current/streaming.html
> https://issues.apache.org/jira/browse/HADOOP-1722
> http://markmail.org/message/24woaqie2a6mrboc
>
> However, I have not got any straight answer.
>
> One recommendation is to put image data on HDFS, but we have to do "hdf
> -get" for each file/dir and process it locally which is every expensive.
>
> Another recommendation is to "...put them in a centralized place where all
> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
> will becomes bottleneck and it defeat the purpose of distributed
> processing.
>
> I also notice some enhancement ticket is open for hadoop-core. Is it
> committed to any svn (0.21) branch? can somebody show me an example how to
> take *.jpg files (from HDFS), and process files in a distributed fashion
> using streaming?
>
> Many thanks
>
> -Qiming
> --
> View this message in context:
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread Alexandra Alecu


jason hadoop wrote:
> 
> You can decommission the datanode, and then un-decommission it.
> 

Thanks Jason, I went off and figured out what to decomission a datanode
means and this looks like a very neat idea.

 Decommissioning requires that the nodes be listed in the dfs.hosts.excludes
file. The administrator runs the "dfsadmin -refreshNodes" command. 

I will need some reconfiguring to be able to do this as the local storage
has exactly the same path on all my datanodes. Essentially, if I change
dfs.data.dir taking away the path to the local storage, it will take it away
on all the datanodes. Therefore I wonder if this advice uncovers a problem
with my cluster configuration. 

When i first installed hadoop on the cluster, since most settings looked the
same for all nodes, I thought  to set same location paths for the local
storage and this way making it easier to put the configuration files in one
directory and then create symlinks from all the hadoop home folders to this
one configuration directory.

Is this what usually people do or have I gone in a completely wrong
direction?
-- 
View this message in context: 
http://www.nabble.com/How-to-replace-the-storage-on-a-datanode-without-formatting-the-namenode--tp23542127p23544682.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: hadoop streaming binary input / image processing

2009-05-14 Thread Zak Stone
Hi Qiming,

You might consider using Dumbo, which is a Python wrapper for Hadoop
Streaming. The associated typedbytes module makes it easy for
streaming programs to work with binary data:

http://wiki.github.com/klbostee/dumbo
http://wiki.github.com/klbostee/typedbytes
http://dumbotics.com/2009/03/03/indexing-typed-bytes/

If you are using an older version of Hadoop (such as 18.3), you will
need to apply the following patches to Hadoop to make typedbytes work:

https://issues.apache.org/jira/browse/HADOOP-1722
https://issues.apache.org/jira/browse/HADOOP-5450

The commands you use to apply the patches might look something like this:

cd 
patch -p0 < HADOOP-1722-branch-0.18.patch
patch -p0 < HADOOP-5450.patch
ant package

The guy who put Dumbo together, Klaas Bosteels, is incredibly helpful,
and he continues to improve this useful project.

Zak


On Thu, May 14, 2009 at 12:39 PM, openresearch
 wrote:
>
> All,
>
> I have read some recommendation regarding image (binary input) processing
> using Hadoop-streaming which only accept text out-of-box for now.
> http://hadoop.apache.org/core/docs/current/streaming.html
> https://issues.apache.org/jira/browse/HADOOP-1722
> http://markmail.org/message/24woaqie2a6mrboc
>
> However, I have not got any straight answer.
>
> One recommendation is to put image data on HDFS, but we have to do "hdf
> -get" for each file/dir and process it locally which is every expensive.
>
> Another recommendation is to "...put them in a centralized place where all
> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
> will becomes bottleneck and it defeat the purpose of distributed processing.
>
> I also notice some enhancement ticket is open for hadoop-core. Is it
> committed to any svn (0.21) branch? can somebody show me an example how to
> take *.jpg files (from HDFS), and process files in a distributed fashion
> using streaming?
>
> Many thanks
>
> -Qiming
> --
> View this message in context: 
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


hadoop streaming binary input / image processing

2009-05-14 Thread openresearch

All,

I have read some recommendation regarding image (binary input) processing
using Hadoop-streaming which only accept text out-of-box for now.
http://hadoop.apache.org/core/docs/current/streaming.html
https://issues.apache.org/jira/browse/HADOOP-1722
http://markmail.org/message/24woaqie2a6mrboc

However, I have not got any straight answer.

One recommendation is to put image data on HDFS, but we have to do "hdf
-get" for each file/dir and process it locally which is every expensive.

Another recommendation is to "...put them in a centralized place where all
the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
will becomes bottleneck and it defeat the purpose of distributed processing. 

I also notice some enhancement ticket is open for hadoop-core. Is it
committed to any svn (0.21) branch? can somebody show me an example how to
take *.jpg files (from HDFS), and process files in a distributed fashion
using streaming?

Many thanks

-Qiming
-- 
View this message in context: 
http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Infinite Loop Resending status from task tracker

2009-05-14 Thread Lance Riedel
Here is the point in the logs where the infinite loop begins - see time
stamp 2009-05-14 04:03:56,348 : (JobTracker)

2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200905122015_1168_m_29_0' from
'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
127.0.0.1:35214'
2009-05-14 04:03:56,326 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_07_0' to tip
task_200905122015_1183_r_07, for tracker
'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
127.0.0.1:58504'
2009-05-14 04:03:56,327 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (cleanup)'attempt_200905122015_1184_m_00_1' to tip
task_200905122015_1184_m_00, for tracker
'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
127.0.0.1
:57741'
2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_200905122015_1182_r_11_0' has completed
task_200905122015_1182_r_11 successfully.
2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (cleanup)'attempt_200905122015_1182_r_10_1' to tip
task_200905122015_1182_r_10, for tracker
'tracker_domU-12-31-38-01-5C-41.compute-1.internal:localhost.localdomain/
127.0.0.1
:46248'
2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_0499_r_04_1
2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_200905122015_1184_m_04_1' has completed
task_200905122015_1184_m_04 successfully.
2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.ResourceEstimator:
measured blowup on task_200905122015_1184_m_04 was 20150008/21581175 =
0.93368447269437372009-05-14 04:03:56,331 INFO
org.apache.hadoop.mapred.ResourceEstimator: new estimate is blowup =
0.9152383292400812
2009-05-14 04:03:56,335 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_08_0' to tip
task_200905122015_1183_r_08, for tracker
'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
127.0.0.1:57741'
2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_09_0' to tip
task_200905122015_1183_r_09, for tracker
'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
127.0.0.1:35214'
2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200905122015_1181_r_09_0' from
'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
127.0.0.1:35214'
2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_0499_r_04_1
2009-05-14 04:03:56,337 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_10_0' to tip
task_200905122015_1183_r_10, for tracker
'tracker_domU-12-31-38-01-81-31.compute-1.internal:localhost.localdomain/
127.0.0.1:46518'
2009-05-14 04:03:56,343 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_1070_r_14_1
2009-05-14 04:03:56,344 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_11_0' to tip
task_200905122015_1183_r_11, for tracker
'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
127.0.0.1:33929'
2009-05-14 04:03:56,348 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54311, call
heartbeat(org.apache.hadoop.mapred.tasktrackersta...@a4fe4, false, true,
21461) from 10.253.178.95:50709: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
at
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown
Source)at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-05-14 04:03:56,351 INFO org.apache.hadoop.mapred.JobTracker:
attempt_200905122015_1183_r_07_0 is 24 ms debug.
2009-05-14 04:03:56,419 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_12_0' to tip
task_200905122015_1183_r_12, for tracker
'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
127.0.0.1:58504'
2009-05-14 04:03:56,459 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_1070_r_14_1
2009-05-14 04:03:56,459 WARN org.apache.hadoop.mapred.TaskInProgress:
Recieved duplicate status u

Re: Infinite Loop Resending status from task tracker

2009-05-14 Thread Lance Riedel
Just had another cluster crash with the same issue. This is still a huge
issue for us- still crashing our cluster every other night (actually almost
every night now).

Should we move to .20? Is there more information i can provide?  Is this
related to my other email "Constantly getting DiskErrorExceptions - but
logged as INFO"? I haven't seen responses on that.

Thanks!
Lance
On Thu, May 14, 2009 at 7:48 AM, Lance Riedel  wrote:

> Here is the latest here.. Haven't heard any more, but every other night we
> get 10 gigs logs and tons of failed tasks and have to restart the cluster
>
> -- Forwarded message --
> From: Lance Riedel 
> Date: Fri, May 8, 2009 at 10:49 AM
> Subject: Re: Infinite Loop Resending status from task tracker
> To: core-user@hadoop.apache.org
> Cc: Brian Long 
>
>
> Hi Todd,
> Sorry, my response got hung up in my outbox for a couple of days.. arghh
>
>
>
> Confirmed that 1) we are not running out of space and 2) that our
> mapred.local.dir directory is not in /tmp
>
> Not sure if this an ec2 problem with a mounted drive?
>
> We had the same thing happen again, exact same logs and symptoms
> (simultaneous in jobtracker and tasktracker)
>
> Thinking about moving to .20 because of this, any thoughts on that?
>
> Thanks,
> Lance
>
>
> On May 4, 2009, at 4:18 PM, Todd Lipcon wrote:
>
>  Hi Lance,
>>
>> Two thoughts here that might be the culprit:
>>
>> 1) Is it possible that the partition that your mapred.local.dir is on is
>> out
>> of space on that task tracker?
>>
>> 2) Is it possible that you're using a directory under /tmp for
>> mapred.local.dir and some system cron script cleared out /tmp?
>>
>> -Todd
>>
>> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel  wrote:
>>
>>  Hi Todd,
>>> Not sure if this is related, but our hadoop cluster in general is getting
>>> more and more unstable.  the logs are full of this error message (but
>>> having
>>> trouble tracking down the root problem):
>>>
>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_/attempt_200904301103__m_01_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_12_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_15_0/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_/attempt_200904301103__m_01_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_12_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_15_0/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_/attempt_200904301103__m_01_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_12_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_15_0/output/file.out
>>> in any of the configured local directories
>>>
>>>
>>> Lance
>>>
>>>
>>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>>>
>>> Hey Lance,
>>>

 Thanks for the logs. They definitely confirmed my suspicion. There are
 two
 problems here:

 1) If the JobTracker throws an exception during processing of a
 heartbeat,
 the tasktracker retries with no delay, since lastHeartbeat isn't updated
>>

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can decommission the datanode, and then un-decommission it.

On Thu, May 14, 2009 at 7:44 AM, Alexandra Alecu
wrote:

>
> Hi,
>
> I want to test how Hadoop and HBase are performing. I have a cluster with 1
> namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2.
>
> I first ran a few tests when the 4 datanodes use local storage specified in
> dfs.data.dir.
> Now, I want to see what is the tradeoff if I switch from local storage to
> network mounted storage (I know it sounds like a crazy idea but
> unfortunately I have to explore this possibility).
>
> I would like to be able to change the dfs.data.dir and maybe in two steps
> be
> able to switch to the network mounted storage.
>
> What I had in mind was the following steps :
>
> 0. Assume initial status is a working cluster with local storage, e.g.
> dfs.data.dir set to local_storage_path.
> 1. Stop cluster: bin/stop-dfs
> 2. Change dfs.data.dir by adding the network_storage_path to the local
> storage_path.
> 3. Start cluster: bin/start-dfs (this will format the new network
> locations,
> which is nice)
> 4.  network storage location>
> 5. Stop cluster: bin/stop-dfs
> 6. Change dfs.data.dir parameter to only contain local_storage_path
> 7.  Start cluster and live happily ever after :-).
>
> The problem is , I don;t know if there is a command or an option to achieve
> step 4.
> Do you have any suggestions ?
>
> I found some info on how to add datanodes, but there is not much info on
> how
> to remove safely (without losing data etc) datanodes or storage locations
> on
> a particular node.
> Is this possible?
>
> Many thanks,
> Alexandra.
>
>
>
>
>
> --
> View this message in context:
> http://www.nabble.com/How-to-replace-the-storage-on-a-datanode-without-formatting-the-namenode--tp23542127p23542127.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
Sort order is preserved if your Mapper doesn't change the key ordering in
output. Partition name is not preserved.

What I have done is to manually work out what the partition number of the
output file should be for each map task, by calling the partitioner on an
input key, and then renaming the output in the close method.

Conceptually the place for this dance is in the OutputCommitter, but I
haven't used them in production code, and my mapside join examples come from
before they were available.

the Hadoop join framework handles setting the split size to Long.MAX_VALUE
for you.

If you put up a discussion question on www.prohadoopbook.com, I will fill in
the example on how to do this.

On Thu, May 14, 2009 at 8:04 AM, Stuart White wrote:

> I'm implementing a map-side join as described in chapter 8 of "Pro
> Hadoop".  I have two files that have been partitioned using the
> TotalOrderPartitioner on the same key into the same number of
> partitions.  I've set mapred.min.split.size to Long.MAX_VALUE so that
> one Mapper will handle an entire partition.
>
> I want the output to be written in the same partitioned, total sort
> order.  If possible, I want to accomplish this by setting my
> NumReducers to 0 and having the output of my Mappers written directly
> to HDFS, thereby skipping the partition/sort step.
>
> My question is this: Am I guaranteed that the Mapper that processes
> part-0 will have its output written to the output file named
> part-0, the Mapper that processes part-1 will have its output
> written to part-1, etc... ?
>
> If so, then I can preserve the partitioning/sort order of my input
> files without re-partitioning and re-sorting.
>
> Thanks.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Regarding Capacity Scheduler

2009-05-14 Thread Hemanth Yamijala

Manish,

The pre-emption code in capacity scheduler was found to require a good 
relook and due to the inherent complexity of the problem is likely to 
have issues of the type you have noticed. We have decided to relook at 
the pre-emption code from scratch and to this effect removed it from the 
0.20 branch to start afresh.


Thanks
Hemanth

Manish Katyal wrote:

I'm experimenting with the Capacity scheduler (0.19.0) in a multi-cluster
environment.
I noticed that unlike the mappers, the reducers are not pre-empted?

I have two queues (high and low) that are each running big jobs (70+ maps
each).  The scheduler splits the mappers as per the queue
guaranteed-capacity (5/8ths for the high and the rest for the low). However,
the reduce jobs are not interleaved -- the reduce job in the high queue is
blocked waiting for the reduce job in the low queue to complete.

Is this a bug or by design?

*Low queue:*
Guaranteed Capacity (%) : 37.5
Guaranteed Capacity Maps : 3
Guaranteed Capacity Reduces : *3*
User Limit : 100
Reclaim Time limit : 300
Number of Running Maps : 3
Number of Running Reduces : *7*
Number of Waiting Maps : 131
Number of Waiting Reduces : 0
Priority Supported : NO

*High queue:*
Guaranteed Capacity (%) : 62.5
Guaranteed Capacity Maps : 5
Guaranteed Capacity Reduces : 5
User Limit : 100
Reclaim Time limit : 300
Number of Running Maps : 4
Number of Running Reduces : *0*
Number of Waiting Maps : 68
Number of Waiting Reduces : *7*
Priority Supported : NO

  




Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread Alexandra Alecu

Another possibility I am thinking about now, which is suitable for me as I do
not actually have much data stored in the cluster when I want to perform
this switch is to set the replication level really high and then simply
remove the local storage locations and restart the cluster. With a bit of
luck the high level of replication will allow a full recovery of the cluster
on restart.

Is this something that you would advice?

Many thanks,
Alexandra.
-- 
View this message in context: 
http://www.nabble.com/How-to-replace-the-storage-on-a-datanode-without-formatting-the-namenode--tp23542127p23542574.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Map-side join: Sort order preserved?

2009-05-14 Thread Stuart White
I'm implementing a map-side join as described in chapter 8 of "Pro
Hadoop".  I have two files that have been partitioned using the
TotalOrderPartitioner on the same key into the same number of
partitions.  I've set mapred.min.split.size to Long.MAX_VALUE so that
one Mapper will handle an entire partition.

I want the output to be written in the same partitioned, total sort
order.  If possible, I want to accomplish this by setting my
NumReducers to 0 and having the output of my Mappers written directly
to HDFS, thereby skipping the partition/sort step.

My question is this: Am I guaranteed that the Mapper that processes
part-0 will have its output written to the output file named
part-0, the Mapper that processes part-1 will have its output
written to part-1, etc... ?

If so, then I can preserve the partitioning/sort order of my input
files without re-partitioning and re-sorting.

Thanks.


Re: Indexing pdfs and docs

2009-05-14 Thread Piotr Praczyk
Hi

First of all, you should probably know, what you want to do exactly. Without
this, it is hard to estimate any hardware requirements.
I assume, you want to use Hadoop for some kind of offline calculations used
for web-based search later ?
In your place I would start with reading about, how such indexing can work.
Then, when you know what you need and know the algorithms being used, you
can estimate the hardware/software requirements.

If you want just to do the indexing and searching of the documents, you
could probably look at the CDS Invenio project (
http://cdsware.cern.ch/invenio/index.html ). It provides such functionality
already.

regards

Piotr


2009/5/14 PORTO aLET 

> Hi,
>
> My company has about 50GB of pdfs and docs, and we would like to be able to
> do some text search over a web interface.
> Is there any good tutorial that specifies hardware requirements and
> software
> specs to do this?
>
> Regards
>


Re: Setting up another machine as secondary node

2009-05-14 Thread jason hadoop
any machine put in the conf/masters file becomes a secondary namenode.

At some point there was confusion on the safety of more than one machine,
which I believe was settled, as many are safe.

The secondary namenode takes a snapshot at 5 minute (configurable)
intervals, rebuilds the fsimage and sends that back to the namenode.
There is some performance advantage of having it on the local machine, and
some safety advantage of having it on an alternate machine.
Could someone who remembers speak up on the single vrs multiple secondary
namenodes?


On Thu, May 14, 2009 at 6:07 AM, David Ritch  wrote:

> First of all, the secondary namenode is not a what you might think a
> secondary is - it's not failover device.  It does make a copy of the
> filesystem metadata periodically, and it integrates the edits into the
> image.  It does *not* provide failover.
>
> Second, you specify its IP address in hadoop-site.xml.  This is where you
> can override the defaults set in hadoop-default.xml.
>
> dbr
>
> On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani  >wrote:
>
> > Hi,
> > I wanna set up a cluster of 5 nodes in such a way that
> > node1 - master
> > node2 - secondary namenode
> > node3 - slave
> > node4 - slave
> > node5 - slave
> >
> >
> > How do we go about that?
> > there is no property in hadoop-env where i can set the ip-address for
> > secondary name node.
> >
> > if i set node-1 and node-2 in masters, and when we start dfs, in both the
> > m/cs, the namenode n secondary namenode processes r present. but i think
> > only node1 is active.
> > n my namenode fail over operation fails.
> >
> > ny suggesstions?
> >
> > Regards,
> > Rakhi
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread Alexandra Alecu

Hi, 

I want to test how Hadoop and HBase are performing. I have a cluster with 1
namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2. 

I first ran a few tests when the 4 datanodes use local storage specified in
dfs.data.dir.
Now, I want to see what is the tradeoff if I switch from local storage to
network mounted storage (I know it sounds like a crazy idea but
unfortunately I have to explore this possibility).

I would like to be able to change the dfs.data.dir and maybe in two steps be
able to switch to the network mounted storage.

What I had in mind was the following steps : 

0. Assume initial status is a working cluster with local storage, e.g.
dfs.data.dir set to local_storage_path.
1. Stop cluster: bin/stop-dfs
2. Change dfs.data.dir by adding the network_storage_path to the local
storage_path.
3. Start cluster: bin/start-dfs (this will format the new network locations,
which is nice)
4. 
5. Stop cluster: bin/stop-dfs
6. Change dfs.data.dir parameter to only contain local_storage_path
7.  Start cluster and live happily ever after :-).

The problem is , I don;t know if there is a command or an option to achieve
step 4.
Do you have any suggestions ?

I found some info on how to add datanodes, but there is not much info on how
to remove safely (without losing data etc) datanodes or storage locations on
a particular node.
Is this possible? 

Many thanks,
Alexandra.





-- 
View this message in context: 
http://www.nabble.com/How-to-replace-the-storage-on-a-datanode-without-formatting-the-namenode--tp23542127p23542127.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
Yes, you're absolutely right.

Tom

On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma  wrote:
> The ec2 documentation point to the use of public 'ip' addresses - whereas 
> using public 'hostnames' seems safe since it resolves to internal addresses 
> from within the cluster (and resolve to public ip addresses from outside).
>
> The only data transfer that I would incur while submitting jobs from outside 
> is the cost of copying the jar files and any other files meant for the 
> distributed cache). That would be extremely small.
>
>
> -Original Message-
> From: Tom White [mailto:t...@cloudera.com]
> Sent: Thursday, May 14, 2009 5:58 AM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> Hi Joydeep,
>
> The problem you are hitting may be because port 50001 isn't open,
> whereas from within the cluster any node may talk to any other node
> (because the security groups are set up to do this).
>
> However I'm not sure this is a good approach. Configuring Hadoop to
> use public IP addresses everywhere should work, but you have to pay
> for all data transfer between nodes (see http://aws.amazon.com/ec2/,
> "Public and Elastic IP Data Transfer"). This is going to get expensive
> fast!
>
> So to get this to work well, we would have to make changes to Hadoop
> so it was aware of both public and private addresses, and use the
> appropriate one: clients would use the public address, while daemons
> would use the private address. I haven't looked at what it would take
> to do this or how invasive it would be.
>
> Cheers,
> Tom
>
> On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma  
> wrote:
>> I changed the ec2 scripts to have fs.default.name assigned to the public 
>> hostname (instead of the private hostname).
>>
>> Now I can submit jobs remotely via the socks proxy (the problem below is 
>> resolved) - but the map tasks fail with an exception:
>>
>>
>> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
>> to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
>> Already tried 9 time(s).
>> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
>> running child
>> java.io.IOException: Call to 
>> ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
>> local exception: Connection refused
>>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy1.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:177)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>>        at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>>
>>
>> strangely enough - job submissions from nodes within the ec2 cluster work 
>> just fine. I looked at the job.xml files of jobs submitted locally and 
>> remotely and don't see any relevant differences.
>>
>> Totally foxed now.
>>
>> Joydeep
>>
>> -Original Message-
>> From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
>> Sent: Wednesday, May 13, 2009 9:38 PM
>> To: core-user@hadoop.apache.org
>> Cc: Tom White
>> Subject: RE: public IP for datanode on EC2
>>
>> Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
>> dfs command line operations work just fine.
>>
>> However - I am hitting a new error during job submission (running 
>> hadoop-0.19.0):
>>
>> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
>> (SessionState.java:printError(279)) - Job Submission failed with exception 
>> 'java.net.UnknownHostException(unknown host: 
>> domU-12-31-39-00-51-94.compute-1.internal)'
>> java.net.UnknownHostException: unknown host: 
>> domU-12-31-39-00-51-94.compute-1.internal
>>        at org.apache.hadoop.ipc.Client$Connection.(Client.java:195)
>>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy0.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:176)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>>        at 
>> org.ap

Indexing pdfs and docs

2009-05-14 Thread PORTO aLET
Hi,

My company has about 50GB of pdfs and docs, and we would like to be able to
do some text search over a web interface.
Is there any good tutorial that specifies hardware requirements and software
specs to do this?

Regards


RE: public IP for datanode on EC2

2009-05-14 Thread Joydeep Sen Sarma
The ec2 documentation point to the use of public 'ip' addresses - whereas using 
public 'hostnames' seems safe since it resolves to internal addresses from 
within the cluster (and resolve to public ip addresses from outside).

The only data transfer that I would incur while submitting jobs from outside is 
the cost of copying the jar files and any other files meant for the distributed 
cache). That would be extremely small.


-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Thursday, May 14, 2009 5:58 AM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).

However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
"Public and Elastic IP Data Transfer"). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.

Cheers,
Tom

On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma  wrote:
> I changed the ec2 scripts to have fs.default.name assigned to the public 
> hostname (instead of the private hostname).
>
> Now I can submit jobs remotely via the socks proxy (the problem below is 
> resolved) - but the map tasks fail with an exception:
>
>
> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
> Already tried 9 time(s).
> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: Call to 
> ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
> local exception: Connection refused
>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy1.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>        at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:177)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>        at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>
>
> strangely enough - job submissions from nodes within the ec2 cluster work 
> just fine. I looked at the job.xml files of jobs submitted locally and 
> remotely and don't see any relevant differences.
>
> Totally foxed now.
>
> Joydeep
>
> -Original Message-
> From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
> Sent: Wednesday, May 13, 2009 9:38 PM
> To: core-user@hadoop.apache.org
> Cc: Tom White
> Subject: RE: public IP for datanode on EC2
>
> Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
> dfs command line operations work just fine.
>
> However - I am hitting a new error during job submission (running 
> hadoop-0.19.0):
>
> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
> (SessionState.java:printError(279)) - Job Submission failed with exception 
> 'java.net.UnknownHostException(unknown host: 
> domU-12-31-39-00-51-94.compute-1.internal)'
> java.net.UnknownHostException: unknown host: 
> domU-12-31-39-00-51-94.compute-1.internal
>        at org.apache.hadoop.ipc.Client$Connection.(Client.java:195)
>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy0.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>        at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:176)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>        at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apa

Re: hadoop getProtocolVersion and getBuildVersion error

2009-05-14 Thread Abhishek Verma
Hi Starry,

I noticed the same problem when I copied hadoop-metrics.properties from my
old hadoop-0.19 conf along with the other files. Make sure you are using the
right version of the conf files.

Hope that helps.

-Abhishek.

On Thu, May 14, 2009 at 7:48 AM, Starry SHI  wrote:

> Nobody has encountered with these problems: "Error
> register getProtocolVersion" and "Error
> register getBuildVersion"?
>
> Starry
>
> /* Tomorrow is another day. So is today. */
>
>
>
> On Tue, May 12, 2009 at 13:27, Starry SHI  wrote:
> > Hi, all. Today I noticed that my hadoop cluster (r0.20.0+jdk1.6) threw
> > some errors in RPC handling. Below is part of the content of namenode
> > log file:
> >
> > 2009-05-12 10:27:08,200 INFO org.apache.hadoop.ipc.Server: Error
> > register getProtocolVersion
> > java.lang.IllegalArgumentException: Duplicate
> metricsName:getProtocolVersion
> >at
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
> >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:396)
> >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> > 2009-05-12 10:27:08,200 INFO org.apache.hadoop.ipc.Server: Error
> > register getProtocolVersion
> > java.lang.IllegalArgumentException: Duplicate
> metricsName:getProtocolVersion
> >at
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
> >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:396)
> >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> > 2009-05-12 10:27:08,200 INFO org.apache.hadoop.ipc.Server: Error
> > register getProtocolVersion
> > java.lang.IllegalArgumentException: Duplicate
> metricsName:getProtocolVersion
> >at
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
> >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:396)
> >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> > 2009-05-12 10:27:08,276 INFO org.apache.hadoop.ipc.Server: Error
> > register getBuildVersion
> > java.lang.IllegalArgumentException: Duplicate metricsName:getBuildVersion
> >at
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
> >at
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
> >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:396)
> >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> >
> > I noticed that the similar error appears on the log in every datanode.
> > Can anybody tell me how to fix this?
> >
> > I have patched this:
> > https://issues.apache.org/jira/browse/HADOOP-5139, but the error still
> > exist. I really don't know what to do and am expecting for your help!
> >
> > Best regards,
> > Starry
> >
> > /* Tomorrow is another day. So is today. */
> >
>


Re: Append in Hadoop

2009-05-14 Thread Sasha Dolgy
search this list for that variable name.  i made a post last week
inquiring about appends() and was given enough information to go hunt
down the info on google and jira

On Thu, May 14, 2009 at 2:01 PM, Vishal Ghawate
 wrote:
> where did you find that property


Re: Setting up another machine as secondary node

2009-05-14 Thread David Ritch
First of all, the secondary namenode is not a what you might think a
secondary is - it's not failover device.  It does make a copy of the
filesystem metadata periodically, and it integrates the edits into the
image.  It does *not* provide failover.

Second, you specify its IP address in hadoop-site.xml.  This is where you
can override the defaults set in hadoop-default.xml.

dbr

On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani wrote:

> Hi,
> I wanna set up a cluster of 5 nodes in such a way that
> node1 - master
> node2 - secondary namenode
> node3 - slave
> node4 - slave
> node5 - slave
>
>
> How do we go about that?
> there is no property in hadoop-env where i can set the ip-address for
> secondary name node.
>
> if i set node-1 and node-2 in masters, and when we start dfs, in both the
> m/cs, the namenode n secondary namenode processes r present. but i think
> only node1 is active.
> n my namenode fail over operation fails.
>
> ny suggesstions?
>
> Regards,
> Rakhi
>


RE: Append in Hadoop

2009-05-14 Thread Vishal Ghawate
where did you find that property

Vishal S. Ghawate

From: Sasha Dolgy [sdo...@gmail.com]
Sent: Thursday, May 14, 2009 6:09 PM
To: core-user@hadoop.apache.org
Subject: Re: Append in Hadoop

yep, i'm using it in 0.19.1 and have used it in 0.20.0

-sasha

On Thu, May 14, 2009 at 1:35 PM, Vishal Ghawate
 wrote:
> is this property available in 0.20.0
> since i dont thik it is there in prior versions
> Vishal S. Ghawate
> 
> From: Sasha Dolgy [sdo...@gmail.com]
> Sent: Thursday, May 14, 2009 6:03 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Append in Hadoop
>
> it's available, although not suitable for production purposes as i've
> found / been told.
>
> put the following in your $HADOOP_HOME/conf/hadoop-site.xml
>
> 
>dfs.support.append
>true
> 
>
>
> -sd
>
> On Thu, May 14, 2009 at 1:27 PM, Wasim Bari  wrote:
>> Hi,
>> Can someone tell about Append functionality in Hadoop. Is it available 
>> now in 0.20 ??
>>
>> Regards,
>>
>> Wasim
>
>
>
> --
> Sasha Dolgy
> sasha.do...@gmail.com
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>



--
Sasha Dolgy
sasha.do...@gmail.com

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Setting up another machine as secondary node

2009-05-14 Thread Rakhi Khatwani
Hi,
 I wanna set up a cluster of 5 nodes in such a way that
node1 - master
node2 - secondary namenode
node3 - slave
node4 - slave
node5 - slave


How do we go about that?
there is no property in hadoop-env where i can set the ip-address for
secondary name node.

if i set node-1 and node-2 in masters, and when we start dfs, in both the
m/cs, the namenode n secondary namenode processes r present. but i think
only node1 is active.
n my namenode fail over operation fails.

ny suggesstions?

Regards,
Rakhi


Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).

However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
"Public and Elastic IP Data Transfer"). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.

Cheers,
Tom

On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma  wrote:
> I changed the ec2 scripts to have fs.default.name assigned to the public 
> hostname (instead of the private hostname).
>
> Now I can submit jobs remotely via the socks proxy (the problem below is 
> resolved) - but the map tasks fail with an exception:
>
>
> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
> Already tried 9 time(s).
> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: Call to 
> ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
> local exception: Connection refused
>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy1.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>        at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:177)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>        at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>
>
> strangely enough - job submissions from nodes within the ec2 cluster work 
> just fine. I looked at the job.xml files of jobs submitted locally and 
> remotely and don't see any relevant differences.
>
> Totally foxed now.
>
> Joydeep
>
> -Original Message-
> From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
> Sent: Wednesday, May 13, 2009 9:38 PM
> To: core-user@hadoop.apache.org
> Cc: Tom White
> Subject: RE: public IP for datanode on EC2
>
> Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
> dfs command line operations work just fine.
>
> However - I am hitting a new error during job submission (running 
> hadoop-0.19.0):
>
> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
> (SessionState.java:printError(279)) - Job Submission failed with exception 
> 'java.net.UnknownHostException(unknown host: 
> domU-12-31-39-00-51-94.compute-1.internal)'
> java.net.UnknownHostException: unknown host: 
> domU-12-31-39-00-51-94.compute-1.internal
>        at org.apache.hadoop.ipc.Client$Connection.(Client.java:195)
>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy0.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>        at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:176)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>        at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
>        at 
> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
>
>
> looking at the stack trace and the code - it seems that this is happening 
> because the jobclient asks for the mapred system directory from the 
> jobtracker - which replies back with a path name that's qualified against the 
> fs.default.name

Re: hadoop getProtocolVersion and getBuildVersion error

2009-05-14 Thread Starry SHI
Nobody has encountered with these problems: "Error
register getProtocolVersion" and "Error
register getBuildVersion"?

Starry

/* Tomorrow is another day. So is today. */



On Tue, May 12, 2009 at 13:27, Starry SHI  wrote:
> Hi, all. Today I noticed that my hadoop cluster (r0.20.0+jdk1.6) threw
> some errors in RPC handling. Below is part of the content of namenode
> log file:
>
> 2009-05-12 10:27:08,200 INFO org.apache.hadoop.ipc.Server: Error
> register getProtocolVersion
> java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
>        at 
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> 2009-05-12 10:27:08,200 INFO org.apache.hadoop.ipc.Server: Error
> register getProtocolVersion
> java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
>        at 
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> 2009-05-12 10:27:08,200 INFO org.apache.hadoop.ipc.Server: Error
> register getProtocolVersion
> java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
>        at 
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> 2009-05-12 10:27:08,276 INFO org.apache.hadoop.ipc.Server: Error
> register getBuildVersion
> java.lang.IllegalArgumentException: Duplicate metricsName:getBuildVersion
>        at 
> org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:56)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:89)
>        at 
> org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.(MetricsTimeVaryingRate.java:99)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>
> I noticed that the similar error appears on the log in every datanode.
> Can anybody tell me how to fix this?
>
> I have patched this:
> https://issues.apache.org/jira/browse/HADOOP-5139, but the error still
> exist. I really don't know what to do and am expecting for your help!
>
> Best regards,
> Starry
>
> /* Tomorrow is another day. So is today. */
>


Re: Append in Hadoop

2009-05-14 Thread Sasha Dolgy
yep, i'm using it in 0.19.1 and have used it in 0.20.0

-sasha

On Thu, May 14, 2009 at 1:35 PM, Vishal Ghawate
 wrote:
> is this property available in 0.20.0
> since i dont thik it is there in prior versions
> Vishal S. Ghawate
> 
> From: Sasha Dolgy [sdo...@gmail.com]
> Sent: Thursday, May 14, 2009 6:03 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Append in Hadoop
>
> it's available, although not suitable for production purposes as i've
> found / been told.
>
> put the following in your $HADOOP_HOME/conf/hadoop-site.xml
>
> 
>        dfs.support.append
>        true
> 
>
>
> -sd
>
> On Thu, May 14, 2009 at 1:27 PM, Wasim Bari  wrote:
>> Hi,
>>     Can someone tell about Append functionality in Hadoop. Is it available 
>> now in 0.20 ??
>>
>> Regards,
>>
>> Wasim
>
>
>
> --
> Sasha Dolgy
> sasha.do...@gmail.com
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>



-- 
Sasha Dolgy
sasha.do...@gmail.com


RE: public IP for datanode on EC2

2009-05-14 Thread Joydeep Sen Sarma
I changed the ec2 scripts to have fs.default.name assigned to the public 
hostname (instead of the private hostname).

Now I can submit jobs remotely via the socks proxy (the problem below is 
resolved) - but the map tasks fail with an exception:


2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already 
tried 9 time(s).
2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: Call to 
ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local 
exception: Connection refused
at org.apache.hadoop.ipc.Client.call(Client.java:699)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:177)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.hadoop.mapred.Child.main(Child.java:153)


strangely enough - job submissions from nodes within the ec2 cluster work just 
fine. I looked at the job.xml files of jobs submitted locally and remotely and 
don't see any relevant differences.

Totally foxed now.

Joydeep

-Original Message-
From: Joydeep Sen Sarma [mailto:jssa...@facebook.com] 
Sent: Wednesday, May 13, 2009 9:38 PM
To: core-user@hadoop.apache.org
Cc: Tom White
Subject: RE: public IP for datanode on EC2

Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs 
command line operations work just fine.

However - I am hitting a new error during job submission (running 
hadoop-0.19.0):

2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
(SessionState.java:printError(279)) - Job Submission failed with exception 
'java.net.UnknownHostException(unknown host: 
domU-12-31-39-00-51-94.compute-1.internal)'
java.net.UnknownHostException: unknown host: 
domU-12-31-39-00-51-94.compute-1.internal
at org.apache.hadoop.ipc.Client$Connection.(Client.java:195)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
at org.apache.hadoop.ipc.Client.call(Client.java:686)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:176)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
at 
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)


looking at the stack trace and the code - it seems that this is happening 
because the jobclient asks for the mapred system directory from the jobtracker 
- which replies back with a path name that's qualified against the 
fs.default.name setting of the jobtracker. Unfortunately the standard EC2 
scripts assign this to the internal hostname of the hadoop master.

Is there any downside to using public hostnames instead of the private ones in 
the ec2 starter scripts?

Thanks for the help,

Joydeep


-Original Message-
From: Philip Zeyliger [mailto:phi...@cloudera.com] 
Sent: Wednesday, May 13, 2009 2:40 PM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma  wrote:
> (raking up real old thread)
>
> After struggling with this issue for sometime now - it seems that accessing 
> hdfs on ec2 from outside ec2 is not possible.
>
> This is primarily because of 
> https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are 
> authorized in ec2 and we set the public hostname via slave.host.name - the 
> namenode uses the internal IP address of the datanodes for block locations. 
> DFS clients outside ec2 cannot reach these addresse

RE: Append in Hadoop

2009-05-14 Thread Vishal Ghawate
is this property available in 0.20.0
since i dont thik it is there in prior versions
Vishal S. Ghawate

From: Sasha Dolgy [sdo...@gmail.com]
Sent: Thursday, May 14, 2009 6:03 PM
To: core-user@hadoop.apache.org
Subject: Re: Append in Hadoop

it's available, although not suitable for production purposes as i've
found / been told.

put the following in your $HADOOP_HOME/conf/hadoop-site.xml


dfs.support.append
true



-sd

On Thu, May 14, 2009 at 1:27 PM, Wasim Bari  wrote:
> Hi,
> Can someone tell about Append functionality in Hadoop. Is it available 
> now in 0.20 ??
>
> Regards,
>
> Wasim



--
Sasha Dolgy
sasha.do...@gmail.com

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Append in Hadoop

2009-05-14 Thread Sasha Dolgy
it's available, although not suitable for production purposes as i've
found / been told.

put the following in your $HADOOP_HOME/conf/hadoop-site.xml


dfs.support.append
true



-sd

On Thu, May 14, 2009 at 1:27 PM, Wasim Bari  wrote:
> Hi,
>     Can someone tell about Append functionality in Hadoop. Is it available 
> now in 0.20 ??
>
> Regards,
>
> Wasim



-- 
Sasha Dolgy
sasha.do...@gmail.com


Append in Hadoop

2009-05-14 Thread Wasim Bari
Hi,
 Can someone tell about Append functionality in Hadoop. Is it available now 
in 0.20 ??

Regards,

Wasim

Re: How to do load control of MapReduce

2009-05-14 Thread zsongbo
We find the disk I/O is the major bottleneck.
Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda   1.00 0.00 85.21  0.00 20926.32 0.00   245.58
 31.59  364.49  11.77 100.28
sdb   5.76  4752.88 53.13 131.08 10145.36 39206.02   267.91
168.34  857.96   5.44 100.28
dm-0  0.00 0.00  5.26  7.5278.2060.1510.82
5.60  461.24  78.31 100.10
dm-1  0.00 0.00 146.12 4875.94 32617.54 39007.5214.26
 5498.79 1021.17   0.20 100.28
dm-2  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00


On Wed, May 13, 2009 at 12:01 AM, Steve Loughran  wrote:

> Stefan Will wrote:
>
>> Yes, I think the JVM uses way more memory than just its heap. Now some of
>> it
>> might be just reserved memory, but not actually used (not sure how to tell
>> the difference). There are also things like thread stacks, jit compiler
>> cache, direct nio byte buffers etc. that take up process space outside of
>> the Java heap. But none of that should imho add up to Gigabytes...
>>
>
> good article on this
> http://www.ibm.com/developerworks/linux/library/j-nativememory-linux/
>
>


managing hadoop using moab

2009-05-14 Thread Vishal Ghawate


hi,
i just wonder if can we use moab cluster suite for managing hadoop cluster
Vishal S. Ghawate
DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Regarding Capacity Scheduler

2009-05-14 Thread Billy Pearson



I am seeing the the same problem posted on the list on the 11th and have not 
any reply.


Billy


- Original Message - 
From: "Manish Katyal" 


Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: 
Sent: Wednesday, May 13, 2009 11:48 AM
Subject: Regarding Capacity Scheduler



I'm experimenting with the Capacity scheduler (0.19.0) in a multi-cluster
environment.
I noticed that unlike the mappers, the reducers are not pre-empted?

I have two queues (high and low) that are each running big jobs (70+ maps
each).  The scheduler splits the mappers as per the queue
guaranteed-capacity (5/8ths for the high and the rest for the low). 
However,

the reduce jobs are not interleaved -- the reduce job in the high queue is
blocked waiting for the reduce job in the low queue to complete.

Is this a bug or by design?

*Low queue:*
Guaranteed Capacity (%) : 37.5
Guaranteed Capacity Maps : 3
Guaranteed Capacity Reduces : *3*
User Limit : 100
Reclaim Time limit : 300
Number of Running Maps : 3
Number of Running Reduces : *7*
Number of Waiting Maps : 131
Number of Waiting Reduces : 0
Priority Supported : NO

*High queue:*
Guaranteed Capacity (%) : 62.5
Guaranteed Capacity Maps : 5
Guaranteed Capacity Reduces : 5
User Limit : 100
Reclaim Time limit : 300
Number of Running Maps : 4
Number of Running Reduces : *0*
Number of Waiting Maps : 68
Number of Waiting Reduces : *7*
Priority Supported : NO






Re: how to connect to remote hadoop dfs by eclipse plugin?

2009-05-14 Thread Rasit OZDAS
Why don't you use it with localhost? Does it have a disadvantage?
As far as I know, there were several host <=> IP problems in hadoop, but
that was a while ago, I think these should have been solved..

It's can also be about the order of IP conversions in IP table file.

2009/5/14 andy2005cst 

>
> when set the IP to localhost, it works well, but if change localhost into
> IP
> address, it does not work at all.
> so, it is to say my hadoop is ok, just the connection failed.
>
>
> Rasit OZDAS wrote:
> >
> > Your hadoop isn't working at all or isn't working at the specified port.
> > - try stop-all.sh command on namenode. if it says "no namenode to stop",
> > then take a look at namenode logs and paste here if anything seems
> > strange.
> > - If namenode logs are ok (filled with INFO messages), then take a look
> at
> > all logs.
> > - In eclipse plugin, left side is for map reduce port, right side is for
> > namenode port, make sure both are same as your configuration in xml files
> >
> > 2009/5/12 andy2005cst 
> >
> >>
> >> when i use eclipse plugin hadoop-0.18.3-eclipse-plugin.jar and try to
> >> connect
> >> to a remote hadoop dfs, i got ioexception. if run a map/reduce program
> it
> >> outputs:
> >> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> >> /**.**.**.**:9100. Already tried 0 time(s).
> >> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> >> /**.**.**.**:9100. Already tried 1 time(s).
> >> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> >> /**.**.**.**:9100. Already tried 2 time(s).
> >> 
> >> Exception in thread "main" java.io.IOException: Call to
> /**.**.**.**:9100
> >> failed on local exception: java.net.SocketException: Connection refused:
> >> connect
> >>
> >> looking forward your help. thanks a lot.
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23498736.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23533748.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ