RE: EOFException while starting name node
Thanks. It worked. Amol -Original Message- From: lohit [mailto:[EMAIL PROTECTED] Sent: Monday, August 04, 2008 10:20 PM To: core-user@hadoop.apache.org Subject: Re: EOFException while starting name node We had seen similar exception earlier reported by others on the list. What you might want to try is to use a hex editor or equivalent to open up 'edits' and get rid of the last record. In all cases, the last record might not be complete so your namenode is not starting. Once you update your edits, start the namenode and run 'hadoop fsck /' to see if you have any corrupt files and fix/get rid of them. PS : Take a back up of dfs.name.dir before updating and playing around with it. Thanks, Lohit - Original Message From: steph [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Monday, August 4, 2008 8:31:07 AM Subject: Re: EOFException while starting name node 2008-08-03 21:58:33,108 INFO org.apache.hadoop.ipc.Server: Stopping server on 9000 2008-08-03 21:58:33,109 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java: 222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:235) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:176) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:162) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) Actually my exception is slightly different than yours. Maybe moving edits file and recreating a new one will work for you. On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote: I'm getting the following exceptions while starting the name node - ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804) Is there a way to recover the name node without losing any data. Thanks, Amol
Re: MultiFileInputFormat and gzipped files
MultiFileWordCount uses its own RecordReader, namely MultiFileLineRecordReader. This is different from the LineRecordReader which automatically detects the file's codec, and decodes it. You can write a custom RecordReader similar to LineRecordReader and MultiFileLineRecordReader, or just add codecs to MultiFileLineRecordReader. Michele Catasta wrote: Hi all, I'm writing some Hadoop jobs that should run on a collection of gzipped files. Everything is already working correctly with MultiFileInputFormat and an initial step of gunzip extraction. Considering that Hadoop recognizes and handles correctly .gz files (at least with a single file input), I was wondering if it's able to do the same with file collections, such that I avoid the overhad of sequential file extraction. I tried to run the multi file WordCount example with a bunch of gzipped text files (0.17.1 installation), and I get a wrong output (neither correct or empty). With my own InputFormat (not really different from the one in multiflewc), I got no output at all (map input record counter = 0). Is it a desired behavior? Are there some technical reasons why it's not working in a multi file scenario? Thanks in advance for the help. Regards, Michele Catasta
libhdfs and multithreaded applications
Hello, At the libhdfs wiki http://wiki.apache.org/hadoop/LibHDFS#Threading I read this: libhdfs can be used in threaded applications using the Posix Threads. However to carefully interact with JNI's global/local references the user has to explicitly call the *hdfsConvertToGlobalRef* / *hdfsDeleteGlobalRef* apis. I cannot seem to find any reference to these functions in the entire hadoop-0.17.1 source base. Are these functions deprecated and multi-threaded applications will be supported out-of-the box, or is something different changed ? Regards, Leon Mergen
log4j problems in hadoop-0.17.1
Dear All, I find in the file conf/log4j.properties,it specifies three appenders : ConsoleAppender DailyRollingFileAppender and TaskLogAppender.I know from the output location that JobClient output target is ConsoleAppender,that JobTracker TaskTracker NameNode and DataNode output target are all DailyRollingFileAppender and TaskLogAppender is the target of everytask`s output no matter map task or reduce task. But the problem is that in that configuration file it only points log4j.rootLogger = INFO,console. Other two appenders does not have corresponding logger,so how does the relationship between DailyRollingFileAppender and JobTracker for example establish? Where can I find it? In source code,script file or configuration file? Because I want to add some logs in my program with log4j. Thanks Very Much!
Linux server clustered HDFS: access from Windows eclipse Java application
Hi all. I'm running a clustering HDFS on linux and I need to access files (I/O) from eclipse Java application running on Windows. It seems simple, but is it possible? I have write code using API but I have a problem: when code invokes DistributedFileSystem.initialize() method I receive an exception: java.net.SocketTimeoutException [code] String ipStr = 192.168.75.191; String portStr = 9000; String uriStr = http://; + ipStr + : + portStr; Configuration conf = new Configuration(); conf.set(hadoop.job.ugi, user,group); // Usuario y grupos a los que pertenece DistributedFileSystem dfs = new DistributedFileSystem(); dfs.initialize(new URI(uriStr), conf); [/code] [trace] Exception in thread main java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:559) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102) at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178) at org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68) at examples.HadoopDFS.main(HadoopDFS.java:153) [/trace] __ Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.
Re: Linux server clustered HDFS: access from Windows eclipse Java application
I think IBM has a plugin that can access HDFS, I don't know whether it contains source code, but maybe it helps. www.alphaworks.*ibm*.com/tech/mapreducetools On Tue, Aug 5, 2008 at 5:16 AM, Alberto Forcén [EMAIL PROTECTED] wrote: Hi all. I'm running a clustering HDFS on linux and I need to access files (I/O) from eclipse Java application running on Windows. It seems simple, but is it possible? I have write code using API but I have a problem: when code invokes DistributedFileSystem.initialize() method I receive an exception: java.net.SocketTimeoutException [code] String ipStr = 192.168.75.191; String portStr = 9000; String uriStr = http://; + ipStr + : + portStr; Configuration conf = new Configuration(); conf.set(hadoop.job.ugi, user,group); // Usuario y grupos a los que pertenece DistributedFileSystem dfs = new DistributedFileSystem(); dfs.initialize(new URI(uriStr), conf); [/code] [trace] Exception in thread main java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:559) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102) at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178) at org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68) at examples.HadoopDFS.main(HadoopDFS.java:153) [/trace] __ Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.
Re: having different HADOOP_HOME for master and slaves?
Is there any way for me to log and find out why the NameNode process is not launching on the master? On Mon, Aug 4, 2008 at 8:19 PM, Meng Mao [EMAIL PROTECTED] wrote: assumption -- if I run stop-all.sh _successfully_ on a Hadoop deployment (which means every node in the grid is using the same path to Hadoop), then that Hadoop installation becomes invisible, and then any other Hadoop deployment could start up and take its place on the grid. Let me know if this assumption is wrong. I was having a lot of grief trying to do a parallel, better permissioned Hadoop install the easy way, so I just went ahead and make copies on each node into the /new/dir location, and pointed hdfs.tmp.dir appropriately. So in a normal start-all.sh sequence, we have the following processes spawned: - master has NameNode, 2ndyNameNode, and JobTracker - worker has DataNode and TaskTracker After I powered down the normal Hadoop installation. I tried to start-all.sh mine. Again, everything with this Hadoop should point its home to /new/dir/hadoop, unless there's some deep hidden param I didn't know about. The processes I got were only - master: 2ndyNameNode, JobTracker - worker: TaskTracker Another hint is the error that calling the hadoop shell gives: $ bin/hadoop dfs -ls / 08/08/04 19:25:32 INFO ipc.Client: Retrying connect to server: master/ip:50001. Already tried 1 time(s). 08/08/04 19:25:33 INFO ipc.Client: Retrying connect to server: master /ip:50001. Already tried 2 time(s). 08/08/04 19:25:34 INFO ipc.Client: Retrying connect to server: master/ip:50001. Already tried 3 time(s). I can't for the life of me reason why the others are missing. On Mon, Aug 4, 2008 at 4:17 PM, Meng Mao [EMAIL PROTECTED] wrote: I see. I think I could also modify the hadoop-env.sh in the new conf/ folders per datanode to point to the right place for HADOOP_HOME. On Mon, Aug 4, 2008 at 3:21 PM, Allen Wittenauer [EMAIL PROTECTED]wrote: On 8/4/08 11:10 AM, Meng Mao [EMAIL PROTECTED] wrote: I suppose I could, for each datanode, symlink things to point to the actual Hadoop installation. But really, I would like the setup that is hinted as possible by statement 1). Is there a way I could do it, or should that bit of documentation read, All machines in the cluster _must_ have the same HADOOP_HOME? If you run the -all scripts, they assume the location is the same. AFAIK, there is nothing preventing you from building your own -all scripts that point to the different location to start/stop the data nodes. -- hustlin, hustlin, everyday I'm hustlin -- hustlin, hustlin, everyday I'm hustlin -- hustlin, hustlin, everyday I'm hustlin
Reducer with two sets of inputs
What's the proposed the design pattern for a reducer that needs two sets of inputs? Are there any source code examples? Thanks :)
Re: Hadoop also applicable in a web app environment?
I am a newbie also, so my answer is not an expert user's by any means. That said: This is not what the MR is designed for... If you have a reporting tool for example, which takes a database a very long time to answer - such a long time that you can't expect a user to hang around waiting for the HTTP response - you might use hadoop to churn through the data and produce the report, with a response to the user your data is being processes, please check back this_URL soon It is not designed as the thing that answers real time synchronous requests though (e.g. users clicking on links), nor to handle high traffic load - for that you need more servers, and a load balancer like you say - and scaling out your DB to have multiple read only copies. Consider a search engine - yahoo are crawling all the web sites, and using MR to process the data to create indexes of the words on pages. But when you search on Yahoo as a user, it is not a MR job that is running to provide the answers. Here you could say MR is playing the role of generating the index offline which is then loaded into something that can answer the query immediately. You might consider lucene or SOLR or something for that... (SOLR especially I would say) You might find http://highscalability.com/ interesting... Cheers, Tim On Tue, Aug 5, 2008 at 8:11 PM, Mork0075 [EMAIL PROTECTED] wrote: Hello, i just discovered the Hadoop project and it looks really interesting to me. As i can see at the moment, Hadoop is really useful for data intensive computations. Is there a Hadoop scenario for scaling web applications too? Normally web applications are not that computation heavy. The need of scaling them, arises from increasing users, which perform (every user in his session) simple operations like querying some data from the database. So distributing this scenario, a Hadoop job would be to map the requests to a certain server in the cluster and reduce it. But this is what load balancers normally do, this doenst solve the scalabilty problem so far. So my question: is there a Hadoop scenario for non computation heavy but heavy load web applications? Thanks a lot
Confusing NameNodeFailover page in Hadoop Wiki
I was wondering around Hadoop wiki and found this page dedicated to name-node failover. http://wiki.apache.org/hadoop/NameNodeFailover I think it is confusing, contradicts other documentation on the subject and contains incorrect facts. See http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Secondary+Namenode http://wiki.apache.org/hadoop/FAQ#7 Besides it contains some kind of discussion. It is not that I am against discussions, lets have them on this list. But I was trying to understand were all the confusion about secondary-node issues comes from lately... Imho we either need to correct it or remove. Thanks, --Konstantin
Re: Hadoop also applicable in a web app environment?
Hello, On Tue, Aug 5, 2008 at 8:11 PM, Mork0075 [EMAIL PROTECTED] wrote: So my question: is there a Hadoop scenario for non computation heavy but heavy load web applications? I suggest you look into HBase, a subproject of hadoop: http://hadoop.apache.org/hbase/ -- it is designed after google's Bigtable and works on top of Hadoop's DFS. It allows quick retrieval of small portions of data, in a distributed fashion. Regards, Leon Mergen
How to write JAVA code for Hadoop streaming.
I am using Hadoop streaming and I want to write the map/reduce scripts in JAVA, rather than perl, etc. Would anybody give me a sample? Thanks
Re: Hadoop also applicable in a web app environment?
Hello: I am actually working on this myself on my project Multisearch. The Map() function uses clients to connect to services and collect responses, and the Reduce() function merges them together. I'm working on putting this into a Servlet as well, too, so it can be used via Tomcat. I've worked with a number of different web services... OGSA-DAI and Axis Web Services. My experience with Hadoop (which is not entirely researched yet) is that it is faster than using these other methods alone. Hopefully by the end of the summer I'll have some more research on this topic (about speed). The other links posted here are really helpful... Kylie On Tue, Aug 5, 2008 at 10:11 AM, Mork0075 [EMAIL PROTECTED] wrote: Hello, i just discovered the Hadoop project and it looks really interesting to me. As i can see at the moment, Hadoop is really useful for data intensive computations. Is there a Hadoop scenario for scaling web applications too? Normally web applications are not that computation heavy. The need of scaling them, arises from increasing users, which perform (every user in his session) simple operations like querying some data from the database. So distributing this scenario, a Hadoop job would be to map the requests to a certain server in the cluster and reduce it. But this is what load balancers normally do, this doenst solve the scalabilty problem so far. So my question: is there a Hadoop scenario for non computation heavy but heavy load web applications? Thanks a lot -- The Circle of the Dragon -- unlock the mystery that is the dragon. http://www.blackdrago.com/index.html Light, seeking light, doth the light of light beguile! -- William Shakespeare's Love's Labor's Lost
DFS. How to read from a specific datanode
Hi, This is about dfs only, not to consider mapreduce. It may sound like a strange need, but sometimes I want to read a block from a specific data node which holds a replica. Figuring out which datanodes have the block is easy. But is there an easy way to specify which datanode I want to load from? Best, -Kevin
Re: Reducer with two sets of inputs
Apologies for misphrasing my question. Let me rephrase it: Using the Hadoop Java APIs is there a suggested way of doing a pair-wise comparison between all LineRecords in a file? More generically: is there a Hadoop Java API design pattern for a reducer to iterate through all the records in another file stored on HDFS? I'm currently using the DistributeCache class to cache the reference file locally. The shard a reducer is examining is always a part of the reference file. My reducer, then, ends up doing all the comparisons between its shard and the reference file. When all of these get combined, I have my pair-wise comparison between all records. Any better ways? On Tue, Aug 5, 2008 at 11:20, Theocharis Ian Athanasakis [EMAIL PROTECTED] wrote: What's the proposed the design pattern for a reducer that needs two sets of inputs? Are there any source code examples? Thanks :)
Re: DFS. How to read from a specific datanode
I havent tried it, but see if you can create DFSClient object and use its open() and read() calls to get the job done. Basically you would have to force currentNode to be your node of interest in there. Just curious, what is the use case for your request? Thanks, Lohit - Original Message From: Kevin [EMAIL PROTECTED] To: core-user@hadoop.apache.org core-user@hadoop.apache.org Sent: Tuesday, August 5, 2008 6:59:55 PM Subject: DFS. How to read from a specific datanode Hi, This is about dfs only, not to consider mapreduce. It may sound like a strange need, but sometimes I want to read a block from a specific data node which holds a replica. Figuring out which datanodes have the block is easy. But is there an easy way to specify which datanode I want to load from? Best, -Kevin