RE: EOFException while starting name node

2008-08-05 Thread Wanjari, Amol
Thanks. It worked. 

Amol

-Original Message-
From: lohit [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 04, 2008 10:20 PM
To: core-user@hadoop.apache.org
Subject: Re: EOFException while starting name node

We had seen similar exception earlier reported by others on the list.
What you might want to try is to use a hex editor or equivalent to open
up 'edits' and get rid of the last record. In all cases, the last record
might not be complete so your namenode is not starting. Once you update
your edits, start the namenode and run 'hadoop fsck /' to see if you
have any corrupt files and fix/get rid of them. 
PS : Take a back up of  dfs.name.dir before updating and playing around
with it.

Thanks,
Lohit



- Original Message 
From: steph [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Monday, August 4, 2008 8:31:07 AM
Subject: Re: EOFException while starting name node


2008-08-03 21:58:33,108 INFO org.apache.hadoop.ipc.Server: Stopping  
server on 9000
2008-08-03 21:58:33,109 ERROR org.apache.hadoop.dfs.NameNode:  
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:

222)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:235)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)


Actually my exception is slightly different than yours. Maybe moving  
edits file
and recreating a new one will work for you.


On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote:

 I'm getting the following exceptions while starting the name node -

 ERROR dfs.NameNode: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
 org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87)
at
 org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620)
at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76)
at
 org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168)
at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804)

 Is there a way to recover the name node without losing any data.

 Thanks,
 Amol


Re: MultiFileInputFormat and gzipped files

2008-08-05 Thread Enis Soztutar
MultiFileWordCount uses its own RecordReader, namely 
MultiFileLineRecordReader. This is different from the LineRecordReader 
which automatically detects the file's codec, and decodes it.


You can write a custom RecordReader similar to LineRecordReader and 
MultiFileLineRecordReader, or just add codecs to MultiFileLineRecordReader.



Michele Catasta wrote:

Hi all,

I'm writing some Hadoop jobs that should run on a collection of
gzipped files. Everything is already working correctly with
MultiFileInputFormat and an initial step of gunzip extraction.
Considering that Hadoop recognizes and handles correctly .gz files (at
least with a single file input), I was wondering if it's able to do
the same with file collections, such that I avoid the overhad of
sequential file extraction.
I tried to run the multi file WordCount example with a bunch of
gzipped text files (0.17.1 installation), and I get a wrong output
(neither correct or empty). With my own InputFormat (not really
different from the one in multiflewc), I got no output at all (map
input record counter = 0).

Is it a desired behavior? Are there some technical reasons why it's
not working in a multi file scenario?
Thanks in advance for the help.


Regards,
  Michele Catasta

  




libhdfs and multithreaded applications

2008-08-05 Thread Leon Mergen
Hello,

At the libhdfs wiki http://wiki.apache.org/hadoop/LibHDFS#Threading I read
this:

libhdfs can be used in threaded applications using the Posix Threads.
However
to carefully interact with JNI's global/local references the user has to
explicitly call
the *hdfsConvertToGlobalRef* / *hdfsDeleteGlobalRef* apis.

I cannot seem to find any reference to these functions in the entire
hadoop-0.17.1
source base. Are these functions deprecated and multi-threaded applications
will
be supported out-of-the box, or is something different changed ?

Regards,

Leon Mergen


log4j problems in hadoop-0.17.1

2008-08-05 Thread wangxiaowei
Dear All,
  I find in the file conf/log4j.properties,it specifies three appenders : 
ConsoleAppender DailyRollingFileAppender and TaskLogAppender.I know from the 
output location that JobClient output target is ConsoleAppender,that JobTracker 
TaskTracker NameNode and DataNode output target are all 
DailyRollingFileAppender and TaskLogAppender is the target of everytask`s 
output no matter map task or reduce task. 
But the problem is that in that configuration file it only points 
log4j.rootLogger = INFO,console. Other two appenders does not have 
corresponding logger,so how does the relationship between 
DailyRollingFileAppender and JobTracker for example establish? Where can I find 
it? In source code,script file or configuration file? Because I want to add 
some logs in my program with log4j.

Thanks Very Much!

Linux server clustered HDFS: access from Windows eclipse Java application

2008-08-05 Thread Alberto Forcén
Hi all.

I'm running a clustering HDFS on linux and I need to access files (I/O) from 
eclipse Java application running on Windows. It seems simple, but is it 
possible?

I have write code using API but I have a problem: when code invokes 
DistributedFileSystem.initialize() method I receive an exception: 
java.net.SocketTimeoutException


[code]
String ipStr = 192.168.75.191;
String portStr = 9000;
String uriStr = http://; + ipStr + : + portStr;

Configuration conf = new Configuration();
conf.set(hadoop.job.ugi, user,group); // Usuario y grupos a los que 
pertenece

DistributedFileSystem dfs = new DistributedFileSystem();
dfs.initialize(new URI(uriStr), conf);
[/code]

[trace]
Exception in thread main java.net.SocketTimeoutException: timed out waiting 
for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178)
at 
org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
at examples.HadoopDFS.main(HadoopDFS.java:153)
[/trace]


  __ 
Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.

Re: Linux server clustered HDFS: access from Windows eclipse Java application

2008-08-05 Thread Qin Gao
I think IBM has a plugin that can access HDFS, I don't know whether it
contains source code, but maybe it helps.

www.alphaworks.*ibm*.com/tech/mapreducetools


On Tue, Aug 5, 2008 at 5:16 AM, Alberto Forcén [EMAIL PROTECTED] wrote:

 Hi all.

 I'm running a clustering HDFS on linux and I need to access files (I/O)
 from eclipse Java application running on Windows. It seems simple, but is it
 possible?

 I have write code using API but I have a problem: when code invokes
 DistributedFileSystem.initialize() method I receive an exception:
 java.net.SocketTimeoutException


 [code]
 String ipStr = 192.168.75.191;
 String portStr = 9000;
 String uriStr = http://; + ipStr + : + portStr;

 Configuration conf = new Configuration();
 conf.set(hadoop.job.ugi, user,group); // Usuario y grupos a los que
 pertenece

 DistributedFileSystem dfs = new DistributedFileSystem();
 dfs.initialize(new URI(uriStr), conf);
 [/code]

 [trace]
 Exception in thread main java.net.SocketTimeoutException: timed out
 waiting for rpc response
 at org.apache.hadoop.ipc.Client.call(Client.java:559)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
 at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
 at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
 at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178)
 at
 org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
 at examples.HadoopDFS.main(HadoopDFS.java:153)
 [/trace]


  __
 Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.



Re: having different HADOOP_HOME for master and slaves?

2008-08-05 Thread Meng Mao
Is there any way for me to log and find out why the NameNode process is not
launching on the master?

On Mon, Aug 4, 2008 at 8:19 PM, Meng Mao [EMAIL PROTECTED] wrote:

 assumption -- if I run stop-all.sh _successfully_ on a Hadoop deployment
 (which means every node in the grid is using the same path to Hadoop), then
 that Hadoop installation becomes invisible, and then any other Hadoop
 deployment could start up and take its place on the grid. Let me know if
 this assumption is wrong.

 I was having a lot of grief trying to do a parallel, better permissioned
 Hadoop install the easy way, so I just went ahead and make copies on each
 node into the /new/dir location, and pointed hdfs.tmp.dir appropriately.

 So in a normal start-all.sh sequence, we have the following processes
 spawned:
 - master has NameNode, 2ndyNameNode, and JobTracker
 - worker has DataNode and TaskTracker

 After I powered down the normal Hadoop installation. I tried to
 start-all.sh mine. Again, everything with this Hadoop should point its home
 to /new/dir/hadoop, unless there's some deep hidden param I didn't know
 about. The processes I got were only
 - master: 2ndyNameNode, JobTracker
 - worker: TaskTracker

 Another hint is the error that calling the hadoop shell gives:
 $ bin/hadoop dfs -ls /
 08/08/04 19:25:32 INFO ipc.Client: Retrying connect to server:
 master/ip:50001. Already tried 1 time(s).
 08/08/04 19:25:33 INFO ipc.Client: Retrying connect to server: master
 /ip:50001. Already tried 2 time(s).
 08/08/04 19:25:34 INFO ipc.Client: Retrying connect to server:
 master/ip:50001. Already tried 3 time(s).

 I can't for the life of me reason why the others are missing.

 On Mon, Aug 4, 2008 at 4:17 PM, Meng Mao [EMAIL PROTECTED] wrote:

 I see. I think I could also modify the hadoop-env.sh in the new conf/
 folders per datanode to point
 to the right place for HADOOP_HOME.


 On Mon, Aug 4, 2008 at 3:21 PM, Allen Wittenauer [EMAIL PROTECTED]wrote:




 On 8/4/08 11:10 AM, Meng Mao [EMAIL PROTECTED] wrote:
  I suppose I could, for each datanode, symlink things to point to the
 actual
  Hadoop installation. But really, I would like the setup that is hinted
 as
  possible by statement 1). Is there a way I could do it, or should that
 bit
  of documentation read, All machines in the cluster _must_ have the
 same
  HADOOP_HOME?

 If you run the -all scripts, they assume the location is the same.
 AFAIK, there is nothing preventing you from building your own -all
 scripts
 that point to the different location to start/stop the data nodes.





 --
 hustlin, hustlin, everyday I'm hustlin




 --
 hustlin, hustlin, everyday I'm hustlin




-- 
hustlin, hustlin, everyday I'm hustlin


Reducer with two sets of inputs

2008-08-05 Thread Theocharis Ian Athanasakis
What's the proposed the design pattern for a reducer that needs two
sets of inputs?
Are there any source code examples?

Thanks :)


Re: Hadoop also applicable in a web app environment?

2008-08-05 Thread tim robertson
I am a newbie also, so my answer is not an expert user's by any means.
 That said:

This is not what the MR is designed for...

If you have a reporting tool for example, which takes a database a
very long time to answer - such a long time that you can't expect a
user to hang around waiting for the HTTP response - you might use
hadoop to churn through the data and produce the report, with a
response to the user your data is being processes, please check back
this_URL soon

It is not designed as the thing that answers real time synchronous
requests though (e.g. users clicking on links), nor to handle high
traffic load - for that you need more servers, and a load balancer
like you say - and scaling out your DB to have multiple read only
copies.

Consider a search engine - yahoo are crawling all the web sites, and
using MR to process the data to create indexes of the words on pages.
But when you search on Yahoo as a user, it is not a MR job that is
running to provide the answers.  Here you could say MR is playing the
role of generating the index offline which is then loaded into
something that can answer the query immediately.  You might consider
lucene or SOLR or something for that... (SOLR especially I would say)

You might find http://highscalability.com/ interesting...

Cheers,

Tim


On Tue, Aug 5, 2008 at 8:11 PM, Mork0075 [EMAIL PROTECTED] wrote:
 Hello,

 i just discovered the Hadoop project and it looks really interesting to me.
 As i can see at the moment, Hadoop is really useful for data intensive
 computations. Is there a Hadoop scenario for scaling web applications too?
 Normally web applications are not that computation heavy. The need of
 scaling them, arises from increasing users, which perform (every user in his
 session) simple operations like querying some data from the database.

 So distributing this scenario, a Hadoop job would be to map the requests
 to a certain server in the cluster and reduce it. But this is what load
 balancers normally do, this doenst solve the scalabilty problem so far.

 So my question: is there a Hadoop scenario for non computation heavy but
 heavy load web applications?

 Thanks a lot



Confusing NameNodeFailover page in Hadoop Wiki

2008-08-05 Thread Konstantin Shvachko

I was wondering around Hadoop wiki and found this page dedicated to name-node 
failover.
http://wiki.apache.org/hadoop/NameNodeFailover

I think it is confusing, contradicts other documentation on the subject and 
contains incorrect facts. See
http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Secondary+Namenode
http://wiki.apache.org/hadoop/FAQ#7

Besides it contains some kind of discussion.
It is not that I am against discussions, lets have them on this list.
But I was trying to understand were all the confusion about secondary-node 
issues comes from lately...

Imho we either need to correct it or remove.

Thanks,
--Konstantin


Re: Hadoop also applicable in a web app environment?

2008-08-05 Thread Leon Mergen
Hello,

On Tue, Aug 5, 2008 at 8:11 PM, Mork0075 [EMAIL PROTECTED] wrote:

 So my question: is there a Hadoop scenario for non computation heavy but
 heavy load web applications?


I suggest you look into HBase, a subproject of hadoop:
http://hadoop.apache.org/hbase/ -- it is designed after google's Bigtable
and works on top of Hadoop's DFS. It allows quick retrieval of small
portions of data, in a distributed fashion.

Regards,

Leon Mergen


How to write JAVA code for Hadoop streaming.

2008-08-05 Thread Gopal Gandhi
I am using Hadoop streaming and I want to write the map/reduce scripts in JAVA, 
rather than perl, etc. Would anybody give me a sample? Thanks



  

Re: Hadoop also applicable in a web app environment?

2008-08-05 Thread Kylie McCormick
Hello:
I am actually working on this myself on my project Multisearch. The Map()
function uses clients to connect to services and collect responses, and the
Reduce() function merges them together. I'm working on putting this into a
Servlet as well, too, so it can be used via Tomcat.

I've worked with a number of different web services... OGSA-DAI and Axis Web
Services. My experience with Hadoop (which is not entirely researched yet)
is that it is faster than using these other methods alone. Hopefully by the
end of the summer I'll have some more research on this topic (about speed).

The other links posted here are really helpful...

Kylie


On Tue, Aug 5, 2008 at 10:11 AM, Mork0075 [EMAIL PROTECTED] wrote:

 Hello,

 i just discovered the Hadoop project and it looks really interesting to me.
 As i can see at the moment, Hadoop is really useful for data intensive
 computations. Is there a Hadoop scenario for scaling web applications too?
 Normally web applications are not that computation heavy. The need of
 scaling them, arises from increasing users, which perform (every user in his
 session) simple operations like querying some data from the database.

 So distributing this scenario, a Hadoop job would be to map the requests
 to a certain server in the cluster and reduce it. But this is what load
 balancers normally do, this doenst solve the scalabilty problem so far.

 So my question: is there a Hadoop scenario for non computation heavy but
 heavy load web applications?

 Thanks a lot




-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

Light, seeking light, doth the light of light beguile!
-- William Shakespeare's Love's Labor's Lost


DFS. How to read from a specific datanode

2008-08-05 Thread Kevin
Hi,

This is about dfs only, not to consider mapreduce. It may sound like a
strange need, but sometimes I want to read a block from a specific
data node which holds a replica. Figuring out which datanodes have the
block is easy. But is there an easy way to specify which datanode I
want to load from?

Best,
-Kevin


Re: Reducer with two sets of inputs

2008-08-05 Thread Theocharis Ian Athanasakis
Apologies for misphrasing my question.

Let me rephrase it: Using the Hadoop Java APIs is there a suggested
way of doing a pair-wise comparison between all LineRecords in a file?

More generically: is there a Hadoop Java API design pattern for a
reducer to iterate through all the records in another file stored on
HDFS?

I'm currently using the DistributeCache class to cache the reference
file locally. The shard a reducer is examining is always a part of the
reference file. My reducer, then, ends up doing all the comparisons
between its shard and the reference file.

When all of these get combined, I have my pair-wise comparison between
all records.

Any better ways?

On Tue, Aug 5, 2008 at 11:20, Theocharis Ian Athanasakis
[EMAIL PROTECTED] wrote:
 What's the proposed the design pattern for a reducer that needs two
 sets of inputs?
 Are there any source code examples?

 Thanks :)



Re: DFS. How to read from a specific datanode

2008-08-05 Thread lohit
 I havent tried it, but see if you can create DFSClient object and use its 
open() and read() calls to get the job done. Basically you would have to force 
currentNode to be your node of interest in there.
Just curious, what is the use case for your request? 

Thanks,
Lohit



- Original Message 
From: Kevin [EMAIL PROTECTED]
To: core-user@hadoop.apache.org core-user@hadoop.apache.org
Sent: Tuesday, August 5, 2008 6:59:55 PM
Subject: DFS. How to read from a specific datanode

Hi,

This is about dfs only, not to consider mapreduce. It may sound like a
strange need, but sometimes I want to read a block from a specific
data node which holds a replica. Figuring out which datanodes have the
block is easy. But is there an easy way to specify which datanode I
want to load from?

Best,
-Kevin