Re: 1st Hadoop India User Group meet

2009-11-10 Thread Amandeep Khurana
Sanjay,

Congratulations for holding the first meetup. All the best with it.

Its exciting to see work being done in India involving Hadoop. I've been a
part of some projects in the Hadoop ecosystem and have done some research
work during my graduate studies as well as for a project at Cisco Systems.

I'm traveling to Delhi in December and would love to meet and talk about how
and what you and other users are doing in this area. Would you be
interested?

Looking forward to hearing from you.

Regards
Amandeep



On Mon, Nov 9, 2009 at 10:19 PM, Sanjay Sharma
sanjay.sha...@impetus.co.inwrote:

 We are planning to hold first Hadoop India user group meet up on 28th
 November 2009 in Noida.

 We would be talking about our experiences with Apache
 Hadoop/Hbase/Hive/PIG/Nutch/etc.

 The agenda would be:
 - Introductions
 - Sharing experiences on Hadoop and related technologies
 - Establishing agenda for the next few meetings
 - Information exchange: tips, tricks, problems and open discussion
 - Possible speaker TBD (invitations open!!)  {we do have something to share
 on Hadoop for newbie  Hadoop Advanced Tuning}

 My company (Impetus) would be providing the meeting room and we should be
 able to accommodate around 40-60 friendly people. Coffee, Tea, and some
 snacks will be provided.

 Please join the linked-in Hadoop India User Group (
 http://www.linkedin.com/groups?home=gid=2258445trk=anet_ug_hm) OR Yahoo
 group (http://tech.groups.yahoo.com/group/hadoopind/) and confirm your
 attendance.

 Regards,
 Sanjay Sharma

 Follow our updates on www.twitter.com/impetuscalling.

 * Impetus Technologies is exhibiting it capabilities in Mobile and Wireless
 in the GSMA Mobile Asia Congress, Hong Kong from November 16-18, 2009. Visit
 http://www.impetus.com/mlabs/GSMA_events.html for details.

 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.



Re: Where is the eclipse plug-in for hadoop 0.20.1

2009-11-10 Thread Jeff Zhang
Hi Stephen,

Thank you. It works


Jeff Zhang



On Mon, Nov 9, 2009 at 10:31 PM, Stephen Watt sw...@us.ibm.com wrote:

 Hi Jeff

 That is correct. The plugin for 0.20.1 exists only in the src/contrib as
 it has some build and runtime issues. It is presently being tracked here -
 http://issues.apache.org/jira/browse/HADOOP-6360

 In the interim, if you go to that JIRA, you can obtain a 0.20.1 plugin.jar
 that I have attached to the JIRA as a stop gap measure. I'd appreciate it
 if you could report in the JIRA what works for you and what does not with
 the attached plugin. Also, if you have any additional features for the
 plugin that you would like to request, feel free to add them as a comment
 to the JIRA.

 Regards
 Steve Watt



 From:
 Jeff Zhang zjf...@gmail.com
 To:
 core-u...@hadoop.apache.org
 Date:
 11/09/2009 12:09 AM
 Subject:
 Where is the eclipse plug-in for hadoop 0.20.1



 Hi all,

 I could not find the ecilpse plug-in for hadoop 0.20.1. I only find the
 source code eclipse plugin. But do not know how to build the plug-in.

 Anyone could give some help?


 Thank you.

 Jeff Zhang





[Ask for help]: IOException: Expecting a line not the end of stream, hadoop-0.20.1 in Daemn Small Linux

2009-11-10 Thread Neo Tan

Dear all,

 

I am new in learning hadoop, encountered a problem while complying 
Hadoop/Quick 
Start(http://hadoop.apache.org/common/docs/current/quickstart.html) tutorial. 
Everything in cygwin is okay, but in Daemn Small Linux(DSL). 

 

In Daemn Small Linux, after executing the command: 

---bin/hadoop jar hadoop-0.20.1-examples.jar grep input output 'dfs[a-z.]+'

output errors as following OUTPUT_01, base on the errors, I tried df -k, 
output as OUTPUT_02, it wasn't NULL/empty.

 

I tried searching all the mailing list of common-user  core-user and Google, 
got noting solution, so I have to send this email to ask you gus' help 
modestly. Please kindly reply if anyone have idea. Thanks in advance! =)

 

Best regards.
Neo Tan
 
OUTPUT_01:==

r...@box:/home/hadoop/hadoop-0.20.1# bin/hadoop jar hadoop-0.20.1-examples.jar 
grep input output 'dfs[a-z.]+'  
09/11/10 17:12:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
09/11/10 17:12:43 INFO mapred.FileInputFormat: Total input paths to process : 5
09/11/10 17:12:44 INFO mapred.FileInputFormat: Total input paths to process : 5
09/11/10 17:12:44 INFO mapred.JobClient: Running job: job_local_0001
09/11/10 17:12:44 INFO mapred.MapTask: numReduceTasks: 1
09/11/10 17:12:44 INFO mapred.MapTask: io.sort.mb = 100
09/11/10 17:12:45 INFO mapred.MapTask: data buffer = 79691776/99614720
09/11/10 17:12:45 INFO mapred.MapTask: record buffer = 262144/327680
09/11/10 17:12:45 INFO mapred.MapTask: Starting flush of map output
09/11/10 17:12:45 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: Expecting a line not the end of stream
at org.apache.hadoop.fs.DF.parseExecResult(DF.java:109)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1431)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
09/11/10 17:12:45 WARN util.Shell: Error reading the error stream
java.io.IOException: Stream closed
at java.io.BufferedReader.ensureOpen(BufferedReader.java:97)
at java.io.BufferedReader.readLine(BufferedReader.java:292)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at org.apache.hadoop.util.Shell$1.run(Shell.java:164)
09/11/10 17:12:45 INFO mapred.JobClient:  map 0% reduce 0%
09/11/10 17:12:45 INFO mapred.JobClient: Job complete: job_local_0001
09/11/10 17:12:45 INFO mapred.JobClient: Counters: 0
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.hadoop.examples.Grep.run(Grep.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.Grep.main(Grep.java:93)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
r...@box:/home/hadoop/hadoop-0.20.1# 
=


OUTPUT_02:=

r...@box:/home/hadoop/hadoop-0.20.1# df -k
Filesystem   1k-blocks  Used Available Use% Mounted on
/dev/hda1  8254240   1027828   6807120  13% /
=
  

Lucene + Hadoop

2009-11-10 Thread Hrishikesh Agashe
Hi,

I am trying to use Hadoop for Lucene index creation. I have to create multiple 
indexes based on contents of the files (i.e. if author is hrishikesh, it 
should be added to a index for hrishikesh. There has to be a separate index 
for every author). For this, I am keeping multiple IndexWriter open for every 
author and maintaining them in a hashmap in map() function. I parse incoming 
file and if I see author is one for which I already have opened a IndexWriter, 
I just add this file in that index, else I create a new IndesWriter for new 
author. As authors might run into thousands, I am closing IndexWriter and 
clearing hashmap once it reaches a certain threshold and starting all over 
again. There is no reduced function.

Does this logic sound correct? Is there any other way of implementing this 
requirement?

--Hrishi

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Automate EC2 cluster termination

2009-11-10 Thread John Clarke
Hi,

I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
but I want to automate it a bit more.

I want to be able to:
- start cluster
- copy data from S3 to the DFS
- run the job
- copy result data from DFS to S3
- verify it all copied ok
- shutdown the cluster.


I guess the hardest part is reliably detecting when a job is complete. I've
seen solutions that provide a time based shutdown but they are not suitable
as our jobs vary in time.

Has anyone made a script that does this already? I'm using the Cloudera
python scripts to start/terminate my cluster.

Thanks,
John


Re: Automate EC2 cluster termination

2009-11-10 Thread Edmund Kohlwey
You should be able to detect the status of the job in your java main() 
method, just do either: job.waitForCompletion(), and, when the job 
finishes running, use job.isSuccessful(), or if you want to you can 
write a custom watcher thread to poll job status manually; this will 
allow you to, for instance, launch several jobs and wait for them to 
return. You will poll the job tracker using either method, but I think 
the overhead is pretty minimal.


I'm not sure if its necessary to copy data from S3 to DFS, btw (unless 
you have a performance reason to do so... even then, since you're not 
really guaranteed very much locality on EC2 you probably won't see a 
huge difference). You should probably just set the default file system 
to s3. See http://wiki.apache.org/hadoop/AmazonS3 .



On 11/10/09 9:13 AM, John Clarke wrote:

Hi,

I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
but I want to automate it a bit more.

I want to be able to:
- start cluster
- copy data from S3 to the DFS
- run the job
- copy result data from DFS to S3
- verify it all copied ok
- shutdown the cluster.


I guess the hardest part is reliably detecting when a job is complete. I've
seen solutions that provide a time based shutdown but they are not suitable
as our jobs vary in time.

Has anyone made a script that does this already? I'm using the Cloudera
python scripts to start/terminate my cluster.

Thanks,
John

   




Hadoop NameNode not starting up

2009-11-10 Thread Kaushal Amin
I am running Hadoop on single server. The issue I am running into is that
start-all.sh script is not starting up NameNode.

Only way I can start NameNode is by formatting it and I end up losing data
in HDFS.

 

Does anyone have solution to this issue?

 

Kaushal

 



Next Boston Hadoop Meetup, Tuesday, November 24th

2009-11-10 Thread Dan Milstein
After a packed, energetic first Boston Hadoop Meetup, we're having  
another.  Next one will be in two weeks, on Tuesday, November 24th, 7  
pm, at the HubSpot offices:


http://www.meetup.com/bostonhadoop/calendar/11834241/

(HubSpot is at 1 Broadway, Cambridge on the fifth floor.  There Will  
Be Food.  There Will Be Beer.)


As before, we'll aim to have 2 c. 20 minute presentations, with plenty  
of time for QA after each, and then a few 5-minute lightning talks.   
Also, the eating and the chatting,.


Please feel free to contact me if you've got an idea for a talk of any  
length, on Hadoop, Hive, Pig, Hbase, etc.


-Dan Milstein
617-401-2855
dmilst...@hubspot.com
http://dev.hubspot.com/



Re: Cross Join

2009-11-10 Thread Edmund Kohlwey
Thanks to all who commented on this. I think there was some confusion 
over what I was trying to do: indeed there was no common key between the 
two tables to join on, which made all the methods I investigated either 
inappropriate or inefficient. In the end I decided to write my own join 
class. It can be written in a reducer or a mapper. While the reducer 
implementation is a bit cleaner, the mapper implementation provides 
(theoretically) better distributed processing. For those who are 
interested, the basic algorithm is:


x is defined as the cross product of two vectors

proc crossproduct:
  Allow mapreduce to partition the left side of the input
  on each mapper
let left_i = save all the left side key/value pairs that are 
processed on that node
in cleanup (or at the end of the reduce) : let right = open the 
right side of the join on each node through hdfs

for each pair of pairs in left_i x right:
if transform (pair) !=null emit transform (pair)
else continue
endfor
  end on each
end proc

The important

On 11/5/09 1:15 PM, Ashutosh Chauhan wrote:

Hi Edmund,

If you can prepare your dataset in a way org.apache.hadoop.mapred.join
requires, then it might be an efficient way to do joins in your case. IMHO
though requirements placed by it though are pretty restrictive. Also,
instead of reinventing the wheel, I would also suggest you to take a look
how Pig tries to solve joining large dataset problem. It has infact four
different join algorithms implemented and one or more them should satisfy
your requirements. It seems to me merge-join of Pig is well suited in your
case. Its only requirement is it wants dataset to be sorted on both sides.
Datasets need not to be equipartitioned, need not to have same number of
partitions etc. You said that sorting the dataset is pain in your case.
Pig's orderby is quite sophisticated and performs sorting rather quite
efficiently. If indeed doing sort is not an option, then you may want to
consider hash join or skewed join of Pig.

Joins in Pig are explained at high-level here:
http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/

Hope it helps,
Ashutosh

On Thu, Nov 5, 2009 at 06:19, Jason Vennerjason.had...@gmail.com  wrote:

   

Look at the join package in map reduce, it provides this functionality
quite
cleaning, for ordered datasets that have the same partitioning.
org.apache.hadoop.mapred.join in hadoop 19

On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlweyekohl...@gmail.com  wrote:

 

Hi,
I'm looking for an efficient way to do a cross join. I've gone through a
few implementations, and I wanted to seek some advice before attempting
another. The join is a large collection to large collection - so
   

there's
 

no trick optimizations like downloading one side of the join on each node
(ie. map side join). The output of the join will be sparse, (its
   

basically
 

matching a large collection of regexes to a large collection of strings),
but because of the nature of the data there's not really any way to
pre-process either side of the join.

1. Naive approach - on a single node, iterate over both collections,
resulting in reading the left file 1 times and the right file n times -
   

I
 

know this is bad.
2. Indexed approach - index data item with a row/col - requires
replicating, sorting, and shuffling all the records 2 times - also not
   

good.
 

This actually seemed to perform worse than 1, and resulted in running out
   

of
 

disk space on the mappers when output was spilled to disk.

I'm now considering what to try next. One idea is to improve on 1 by
blocking the reads, so that the right side of the join is read b times,
where b is the number of blocks the left side is split into.

The other (imho, best) idea is to write a reduce-side join, which would
actually be fully parallelized, which basically relies on map/reduce to
split the left side into blocks, and then allows each reducer to stream
through the right side once. In this version, the right side is still
downloaded b times, but the operation is done in parallel. The only issue
with this is that I would need to iterate over the reduce iterators
   

multiple
 

times, which is something that M/R doesn't allow (I think). I know I
   

could
 

save the contents of the iterator locally, but this seems like a bad
   

design
 

choice too. Does anybody know if there's a smart way to iterate twice in
   

a
 

reducer?

There's probably some methods I haven't really thought of. Does anyone
   

have
 

any suggestions?



   


--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

 
   




Re: Hadoop NameNode not starting up

2009-11-10 Thread Edmund Kohlwey

Is there error output from start-all.sh?

On 11/9/09 11:10 PM, Kaushal Amin wrote:

I am running Hadoop on single server. The issue I am running into is that
start-all.sh script is not starting up NameNode.

Only way I can start NameNode is by formatting it and I end up losing data
in HDFS.



Does anyone have solution to this issue?



Kaushal




   




Error with replication and namespaceID

2009-11-10 Thread Raymond Jennings III
On the actual datanodes I see the following exception:  I am not sure what the 
namespaceID is or how to sync them.  Thanks for any advice!



/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = pingo-3.poly.edu/128.238.55.33
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.1
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 
810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
/
2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: 
namenode namespaceID = 1016244663; datanode namespaceID = 1687029285
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)


--- On Mon, 11/9/09, Boris Shkolnik bo...@yahoo-inc.com wrote:

 From: Boris Shkolnik bo...@yahoo-inc.com
 Subject: Re: newbie question - error with replication
 To: common-user@hadoop.apache.org
 Date: Monday, November 9, 2009, 5:02 PM
 Make sure you have at least one
 datanode running.
 Look at the data node log file. (logs/*-datanode-*.log)
 
 Boris.
 
 
 On 11/9/09 7:15 AM, Raymond Jennings III raymondj...@yahoo.com
 wrote:
 
  I am trying to resolve an IOException error.  I
 have a basic setup and shortly
  after running start-dfs.sh I get a:
  
  error: java.io.IOException: File
  /tmp/hadoop-root/mapred/system/jobtracker.info could
 only be replicated to 0
  nodes, instead of 1
  java.io.IOException: File
 /tmp/hadoop-root/mapred/system/jobtracker.info could
  only be replicated to 0 nodes, instead of 1
  
  Any pointers how to resolve this?  Thanks!
  
  
  
        
 
 





java.io.IOException: Could not obtain block:

2009-11-10 Thread John Martyniak

Hello everyone,

I am getting this error java.io.IOException: Could not obtain block:,  
when running on my new cluster.  When I ran the same job on the single  
node it worked perfectly, I then added in the second node, and receive  
this error.  I was running the grep sample job.


I am running Hadoop 0.19.2, because of a dependency on Nutch  
(Eventhough this was not a Nutch job).  I am not running HBase, the  
version of Java is OpenJDK 1.6.0.


Does anybody have any ideas?

Thanks in advance,

-John



Hadoop User Group (Bay Area) - next Wednesday (Nov 18th) at Yahoo!

2009-11-10 Thread Dekel Tankel
Hi all,

We are one week away from the next Bay Area Hadoop User Group  - Yahoo! 
Sunnyvale Campus, next Wednesday (Nov 18th) at 6PM

We have an exciting evening planed:

*Katta, Solr, Lucene and Hadoop - Searching at scale, Jason Rutherglen and 
Jason Venner

*Walking through the New File system API, Sanjay Radia, Yahoo!

*Keep your data in Jute but still use it in python, Paul Tarjan, Yahoo!


Please RSVP here:
http://www.meetup.com/hadoop/calendar/11724002/


Please note that this is the last HUG for 2009, as we will not have a meeting 
on December (due to the holidays).
We will open 2010 with a HUG on Jan 20th.

Looking forward to seeing you next week!

Dekel



Re: Re: how to read file in hadoop

2009-11-10 Thread Gang Luo
It is because the content I read from the file is encoded in UTF8, I
use Text.decode to decode it back to plain text string, the problem is
gone now.

-Gang


- 原始邮件 
发件人: Gang Luo lgpub...@yahoo.com.cn
收件人: common-user@hadoop.apache.org
发送日期: 2009/11/10 (周二) 12:14:44 上午
主   题: Re: Re: how to read file in hadoop

I download it to my local filesystem. The content is correct, I can see it 
either by command or by texteditor. So, I think the file itself has no problem.

--Gang



- 原始邮件 
发件人: Jeff Zhang zjf...@gmail.com
收件人: common-user@hadoop.apache.org
发送日期: 2009/11/9 (周一) 11:58:22 下午
主   题: Re: Re: how to read file in hadoop

Maybe you can download the file to local to see what content is there.


Jeff Zhang


2009/11/10 Gang Luo lgpub...@yahoo.com.cn

 Since no response to this question up to now, I'd like to discribe more
 details about it.

 I try to read a file in HDFS and copy it to another file. It works well and
 I can see the content by 'cat' is what it supposed to be. The only problems
 is that, when I read it to Bytes[] and print it out to stdout, it is NOT
 what it should be. Thus, I cannot do anything (e.g. comparison) except write
 it directely to another file.

 I guess this problem may due to the setting of file format (text or binary)
 or coding (e.g.utf-8). Can someone give me some ideas?


 --Gang



 - 原始邮件 
 发件人: Gang Luo lgpub...@yahoo.com.cn
 收件人: common-user@hadoop.apache.org
 发送日期: 2009/11/9 (周一) 11:47:02 上午
 主   题: how to read file in hadoop

 Hi all
 I want to use HDFS IO api to read a result file of the previous mapreduce
 job. But what I read is not the things in that file, say the content I print
 to stdout is different from what I get from the console by command 'cat'. I
 guese there maybe some problem about the file format (binary or text). Can
 anyone give me some hints?


 Gang Luo



  ___
  好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/



  ___
  好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/




  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/



  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


stdout logs ?

2009-11-10 Thread Siddu
Hi all


In

src/contrib/data_join/src/java/org/apache/hadoop/contrib/utils/join/DataJoinJob.java

i found a couple of println statements (shown below )which are getting executed

when submitted for a job .

I am not sure to which stdout they are printing ?

I searched in logs/* but dint find it ?

Can somebody  please tell me where are they logged

btw i am running this job on a cluster which has only one node .

try {
 running = jc.submitJob(job);
 JobID jobId = running.getID();
 System.out.println(Job  + jobId +  is submitted);
 while (!running.isComplete()) {
   System.out.println(Job  + jobId +  is still running.);
   try {
 Thread.sleep(6);
   } catch (InterruptedException e) {


Re: Error with replication and namespaceID

2009-11-10 Thread Edmund Kohlwey

Hi Ray,
You'll probably find that even though the name node starts, it doesn't 
have any data nodes and is completely empty.


Whenever hadoop creates a new filesystem, it assigns a large random 
number to it to prevent you from mixing datanodes from different 
filesystems on accident. When you reformat the name node its FS has one 
ID, but your data nodes still have chunks of the old FS with a different 
ID and so will refuse to connect to the namenode. You need to make sure 
these are cleaned up before reformatting. You can do it just by deleting 
the data node directory, although there's probably a more official way 
to do it.



On 11/10/09 11:01 AM, Raymond Jennings III wrote:

On the actual datanodes I see the following exception:  I am not sure what the 
namespaceID is or how to sync them.  Thanks for any advice!



/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = pingo-3.poly.edu/128.238.55.33
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.1
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 
810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
/
2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: 
namenode namespaceID = 1016244663; datanode namespaceID = 1687029285
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)


--- On Mon, 11/9/09, Boris Shkolnikbo...@yahoo-inc.com  wrote:

   

From: Boris Shkolnikbo...@yahoo-inc.com
Subject: Re: newbie question - error with replication
To: common-user@hadoop.apache.org
Date: Monday, November 9, 2009, 5:02 PM
Make sure you have at least one
datanode running.
Look at the data node log file. (logs/*-datanode-*.log)

Boris.


On 11/9/09 7:15 AM, Raymond Jennings IIIraymondj...@yahoo.com
wrote:

 

I am trying to resolve an IOException error.  I
   

have a basic setup and shortly
 

after running start-dfs.sh I get a:

error: java.io.IOException: File
/tmp/hadoop-root/mapred/system/jobtracker.info could
   

only be replicated to 0
 

nodes, instead of 1
java.io.IOException: File
   

/tmp/hadoop-root/mapred/system/jobtracker.info could
 

only be replicated to 0 nodes, instead of 1

Any pointers how to resolve this?  Thanks!




   


 



   




Re: stdout logs ?

2009-11-10 Thread Gang Luo
Hi Siddu,
I asked this question couple of days ago. You should use
you browser access the jobtracker. Click a job id -map-pick a
map task-click the link at the column task log, you will see the
output at stdout and stderr.

-Gang




--Original Message-

In

src/contrib/data_join/src/java/org/apache/hadoop/contrib/utils/join/DataJoinJob.java

i found a couple of println statements (shown below )which are getting executed

when submitted for a job .

I am not sure to which stdout they are printing ?

I searched in logs/* but dint find it ?

Can somebody  please tell me where are they logged

btw i am running this job on a cluster which has only one node .

try {
running = jc.submitJob(job);
JobID jobId = running.getID();
System.out.println(Job  + jobId +  is submitted);
while (!running.isComplete()) {
   System.out.println(Job  + jobId +  is still running.);
   try {
 Thread.sleep(6);
   } catch (InterruptedException e) {



  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


Re: Error with replication and namespaceID

2009-11-10 Thread Raymond Jennings III
Thanks!!!  That worked!  I guess I can edit the number on the datanodes as well 
but if there is an even more official way to resolve this I would be 
interested in hearing about it.

--- On Tue, 11/10/09, Edmund Kohlwey ekohl...@gmail.com wrote:

 From: Edmund Kohlwey ekohl...@gmail.com
 Subject: Re: Error with replication and namespaceID
 To: common-user@hadoop.apache.org
 Date: Tuesday, November 10, 2009, 1:46 PM
 Hi Ray,
 You'll probably find that even though the name node starts,
 it doesn't 
 have any data nodes and is completely empty.
 
 Whenever hadoop creates a new filesystem, it assigns a
 large random 
 number to it to prevent you from mixing datanodes from
 different 
 filesystems on accident. When you reformat the name node
 its FS has one 
 ID, but your data nodes still have chunks of the old FS
 with a different 
 ID and so will refuse to connect to the namenode. You need
 to make sure 
 these are cleaned up before reformatting. You can do it
 just by deleting 
 the data node directory, although there's probably a more
 official way 
 to do it.
 
 
 On 11/10/09 11:01 AM, Raymond Jennings III wrote:
  On the actual datanodes I see the following
 exception:  I am not sure what the namespaceID is or
 how to sync them.  Thanks for any advice!
 
 
 
 
 /
  STARTUP_MSG: Starting DataNode
  STARTUP_MSG:   host =
 pingo-3.poly.edu/128.238.55.33
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.20.1
  STARTUP_MSG:   build = 
  http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1
 -r 810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC
 2009
 
 /
  2009-11-09 09:57:45,328 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
 java.io.IOException: Incompatible namespaceIDs in
 /tmp/hadoop-root/dfs/data: namenode namespaceID =
 1016244663; datanode namespaceID = 1687029285
           at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
           at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 
 
  --- On Mon, 11/9/09, Boris Shkolnikbo...@yahoo-inc.com 
 wrote:
 
     
  From: Boris Shkolnikbo...@yahoo-inc.com
  Subject: Re: newbie question - error with
 replication
  To: common-user@hadoop.apache.org
  Date: Monday, November 9, 2009, 5:02 PM
  Make sure you have at least one
  datanode running.
  Look at the data node log file.
 (logs/*-datanode-*.log)
 
  Boris.
 
 
  On 11/9/09 7:15 AM, Raymond Jennings IIIraymondj...@yahoo.com
  wrote:
 
       
  I am trying to resolve an IOException
 error.  I
         
  have a basic setup and shortly
       
  after running start-dfs.sh I get a:
 
  error: java.io.IOException: File
  /tmp/hadoop-root/mapred/system/jobtracker.info
 could
         
  only be replicated to 0
       
  nodes, instead of 1
  java.io.IOException: File
         
  /tmp/hadoop-root/mapred/system/jobtracker.info
 could
       
  only be replicated to 0 nodes, instead of 1
 
  Any pointers how to resolve this? 
 Thanks!
 
 
 
 
         
 
       
 
 
     
 
 





Should I upgrade from 0.18.3 to the latest 0.20.1?

2009-11-10 Thread Mark Kerzner
Hi,

I've been working on my project for about a year, and I decided to upgrade
from 0.18.3 (which was stable and already old even back then). I have
started, but I see that many classes have changed, many are deprecated, and
I need to re-write some code. Is it worth it? What are the advantages of
doing this? Other areas of concern are:

   - Will Amazon EMR work with the latest Hadoop?
   - What about Cloudera distribution or Yahoo distribution?

Thank you,
Mark


error setting up hdfs?

2009-11-10 Thread zenkalia
had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls
ls: Cannot access .: No such file or directory.

anyone else get this one?  i started changing settings on my box to get all
of my cores working, but immediately hit this error.  since then i started
from scratch and have hit this error again.  what am i missing?


Re: error setting up hdfs?

2009-11-10 Thread Stephen Watt
You need to specify a path. Try  bin/hadoop dfs -ls / 

Steve Watt



From:
zenkalia zenka...@gmail.com
To:
core-u...@hadoop.apache.org
Date:
11/10/2009 03:04 PM
Subject:
error setting up hdfs?



had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls
ls: Cannot access .: No such file or directory.

anyone else get this one?  i started changing settings on my box to get 
all
of my cores working, but immediately hit this error.  since then i started
from scratch and have hit this error again.  what am i missing?




Re: Automate EC2 cluster termination

2009-11-10 Thread Hitchcock, Andrew
Hi John,

Have you considered Amazon Elastic MapReduce? (Disclaimer: I work on Elastic 
MapReduce)

http://aws.amazon.com/elasticmapreduce/

It waits for your job to finish and then automatically shuts down the cluster. 
With a simple command like:

 elastic-mapreduce --create --num-instances 10 --jar s3://mybucket/my.jar 
--args s3://mybucket/input/,s3://mybucket/output/

It will automatically create a cluster, run your jar, and then shut everything 
down. Elastic MapReduce costs a little bit more than just plain EC2, but if it 
prevents your cluster from running longer than necessary, you might save money.

Andrew


On 11/10/09 6:13 AM, John Clarke clarke...@gmail.com wrote:

Hi,

I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great
but I want to automate it a bit more.

I want to be able to:
- start cluster
- copy data from S3 to the DFS
- run the job
- copy result data from DFS to S3
- verify it all copied ok
- shutdown the cluster.


I guess the hardest part is reliably detecting when a job is complete. I've
seen solutions that provide a time based shutdown but they are not suitable
as our jobs vary in time.

Has anyone made a script that does this already? I'm using the Cloudera
python scripts to start/terminate my cluster.

Thanks,
John



Re: Hadoop NameNode not starting up

2009-11-10 Thread Stephen Watt
You need to go to your logs directory and have a look at what is going on 
in the namenode log. What version are you using ? 

I'm going to take a guess at your issue here and say that you used the 
/tmp as a path for some of your hadoop conf settings and you have rebooted 
lately. The /tmp dir is wiped out on reboot.

Kind regards
Steve Watt



From:
Kaushal Amin kaushala...@gmail.com
To:
common-user@hadoop.apache.org
Date:
11/10/2009 08:47 AM
Subject:
Hadoop NameNode not starting up



I am running Hadoop on single server. The issue I am running into is that
start-all.sh script is not starting up NameNode.

Only way I can start NameNode is by formatting it and I end up losing data
in HDFS.

 

Does anyone have solution to this issue?

 

Kaushal

 





Re: Hadoop NameNode not starting up

2009-11-10 Thread Sagar

did u format it for the first time
another quick way to fugure out
is
${HADOOP_HOME}/bin/hadoop start namenode

see wht error it gives

-Sagar

Stephen Watt wrote:
You need to go to your logs directory and have a look at what is going on 
in the namenode log. What version are you using ? 

I'm going to take a guess at your issue here and say that you used the 
/tmp as a path for some of your hadoop conf settings and you have rebooted 
lately. The /tmp dir is wiped out on reboot.


Kind regards
Steve Watt



From:
Kaushal Amin kaushala...@gmail.com
To:
common-user@hadoop.apache.org
Date:
11/10/2009 08:47 AM
Subject:
Hadoop NameNode not starting up



I am running Hadoop on single server. The issue I am running into is that
start-all.sh script is not starting up NameNode.

Only way I can start NameNode is by formatting it and I end up losing data
in HDFS.

 


Does anyone have solution to this issue?

 


Kaushal

 





  




Re: error setting up hdfs?

2009-11-10 Thread zenkalia
ok, things are working..  i must have forgotten what i did when first
setting up hadoop...

should these responses be considered inconsistent/an error?  hmm.

hadoop dfs -ls
error
hadoop dfs -ls /
irrelevant stuff about the path you're in
hadoop dfs -mkdir lol
works fine
hadoop dfs -ls
Found 1 items
drwxr-xr-x   - hadoop supergroup  0 2009-11-10 05:28
/user/hadoop/lol

thanks stephen.
-mike

On Tue, Nov 10, 2009 at 1:19 PM, Stephen Watt sw...@us.ibm.com wrote:

 You need to specify a path. Try  bin/hadoop dfs -ls / 

 Steve Watt



 From:
 zenkalia zenka...@gmail.com
 To:
 core-u...@hadoop.apache.org
 Date:
 11/10/2009 03:04 PM
 Subject:
 error setting up hdfs?



 had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls
 ls: Cannot access .: No such file or directory.

 anyone else get this one?  i started changing settings on my box to get
 all
 of my cores working, but immediately hit this error.  since then i started
 from scratch and have hit this error again.  what am i missing?





Anyone using Hadoop in Austin, Texas ?

2009-11-10 Thread Stephen Watt
Just curious to see if there are any hadoop compatriots around and if 
there are, maybe we could organize a meetup. 

Regards
Steve Watt 


Re: Anyone using Hadoop in Austin, Texas ?

2009-11-10 Thread Mark Kerzner
Me in Houston :)

Mark

On Tue, Nov 10, 2009 at 3:32 PM, Stephen Watt sw...@us.ibm.com wrote:

 Just curious to see if there are any hadoop compatriots around and if
 there are, maybe we could organize a meetup.

 Regards
 Steve Watt



Re: error setting up hdfs?

2009-11-10 Thread Aaron Kimball
You don't need to specify a path. If you don't specify a path argument for
ls, then it uses your home directory in HDFS (/user/yourusernamehere).
When you first started HDFS, /user/hadoop didn't exist, so 'hadoop fs -ls'
-- 'hadoop fs -ls /user/hadoop' -- directory not found. When you mkdir'd
'lol', you were actually effectively doing mkdir -p /user/hadoop/lol, so
then it created your home directory underneath of that.

- Aaron

On Tue, Nov 10, 2009 at 1:30 PM, zenkalia zenka...@gmail.com wrote:

 ok, things are working..  i must have forgotten what i did when first
 setting up hadoop...

 should these responses be considered inconsistent/an error?  hmm.

 hadoop dfs -ls
 error
 hadoop dfs -ls /
 irrelevant stuff about the path you're in
 hadoop dfs -mkdir lol
 works fine
 hadoop dfs -ls
 Found 1 items
 drwxr-xr-x   - hadoop supergroup  0 2009-11-10 05:28
 /user/hadoop/lol

 thanks stephen.
 -mike

 On Tue, Nov 10, 2009 at 1:19 PM, Stephen Watt sw...@us.ibm.com wrote:

  You need to specify a path. Try  bin/hadoop dfs -ls / 
 
  Steve Watt
 
 
 
  From:
  zenkalia zenka...@gmail.com
  To:
  core-u...@hadoop.apache.org
  Date:
  11/10/2009 03:04 PM
  Subject:
  error setting up hdfs?
 
 
 
  had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls
  ls: Cannot access .: No such file or directory.
 
  anyone else get this one?  i started changing settings on my box to get
  all
  of my cores working, but immediately hit this error.  since then i
 started
  from scratch and have hit this error again.  what am i missing?
 
 
 



Re: Lucene + Hadoop

2009-11-10 Thread Otis Gospodnetic
I think that sounds right.
I believe that's what I did when I implemented this type of functionality for 
http://simpy.com/

I'm not sure why this is a Hadoop thing, though.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Hrishikesh Agashe hrishikesh_aga...@persistent.co.in
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Tue, November 10, 2009 4:56:33 AM
 Subject: Lucene + Hadoop
 
 Hi,
 
 I am trying to use Hadoop for Lucene index creation. I have to create 
 multiple 
 indexes based on contents of the files (i.e. if author is hrishikesh, it 
 should be added to a index for hrishikesh. There has to be a separate index 
 for every author). For this, I am keeping multiple IndexWriter open for every 
 author and maintaining them in a hashmap in map() function. I parse incoming 
 file and if I see author is one for which I already have opened a 
 IndexWriter, I 
 just add this file in that index, else I create a new IndesWriter for new 
 author. As authors might run into thousands, I am closing IndexWriter and 
 clearing hashmap once it reaches a certain threshold and starting all over 
 again. There is no reduced function.
 
 Does this logic sound correct? Is there any other way of implementing this 
 requirement?
 
 --Hrishi
 
 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the 
 property of Persistent Systems Ltd. It is intended only for the use of the 
 individual or entity to which it is addressed. If you are not the intended 
 recipient, you are not authorized to read, retain, copy, print, distribute or 
 use this message. If you have received this communication in error, please 
 notify the sender and delete all copies of this message. Persistent Systems 
 Ltd. 
 does not accept any liability for virus infected mails.



Re: Hadoop User Group Maryland/DC Area

2009-11-10 Thread Jeff Hammerbacher
Hey Abhi,

Check out http://www.meetup.com/Hadoop-DC/.

Regards,
Jeff

On Tue, Nov 10, 2009 at 9:26 AM, Abhishek Pratap abhishek@gmail.comwrote:

 Hi Guys

 Just wondering if there is any Hadoop group functioning in the Maryland/DC
 area. I would love to be a part and learn few things along the way.

 Cheers,
 -Abhi



Re: Should I upgrade from 0.18.3 to the latest 0.20.1?

2009-11-10 Thread Edmund Kohlwey
The new API in 0.20.x is likely not what you'll see in the final Hadoop 
1.0 release, which I've heard some people forecast within the next 18 
months or so (we'll see). There will likely be a 0.21.x series, and then 
the final release.


That having been said, its much more similar to what you'll see in the 
final release. Depending on how complex your jobs are, you may see minor 
or no changes in the final release, or you may see dramatic ones. I 
think (someone correct me if I'm wrong) the basic map and reduce 
abstract classes are just about set in stone, but if you're using other 
stuff like file formats, custom splits, etc. then you may see a lot of 
differences. I've also noticed a lot of changes in how the job and task 
trackers work, even in the current trunk. There's also some interesting 
work being done by yahoo on pipelining MR jobs, which will not be in any 
0.20.x release.


The other thing about 0.20.x is that a lot of the old API (like joins, 
etc.) has not been updated, so your application may be a hodgepodge 
patchwork of the two APIs.


Are there any portions of the new API which are particularly attractive 
to you? That might help people suggest weather or not you should switch 
to satisfy that need. If you don't have any needs particular to the 
0.20.x API then there's probably little reason to switch.


If you do upgrade to 0.20.1, make sure to get the cloudera or yahoo 
distributions. The current stable (0.20.1) release on the Apache page 
is very buggy.


On 11/10/09 3:30 PM, Mark Kerzner wrote:

Hi,

I've been working on my project for about a year, and I decided to upgrade
from 0.18.3 (which was stable and already old even back then). I have
started, but I see that many classes have changed, many are deprecated, and
I need to re-write some code. Is it worth it? What are the advantages of
doing this? Other areas of concern are:

- Will Amazon EMR work with the latest Hadoop?
- What about Cloudera distribution or Yahoo distribution?

Thank you,
Mark

   




Re: NameNode/DataNode JobTracker/TaskTracker

2009-11-10 Thread Todd Lipcon
On Mon, Nov 9, 2009 at 1:04 PM, John Martyniak j...@beforedawnsolutions.com
 wrote:

 Thanks Todd.

 I wasn't sure if that is possible.  But you pointed out an important point
 and that is it is just NN and JT that would run remotely.

 So in order to do this would I just install the complete hadoop instance on
 each one.  And then would they be configed as masters?

 Or should NameNode and JobTracker run on the same machine?  So there would
 be one master.


Either way. On all clusters but the largest, the NN and JT are not
significant users of CPU. On medium size clusters they can start to use up
multiple GBs of RAM. If you're using less than 30 nodes you can *probably*
get by with one machine for both; I say probably because it depends on not
just your total capacity but also the number of files you have. There are
some rough sizing estimates if you google the archives for CompressedOops
I think - someone did some measurements of the NN's memory requirements.


 So when I start the cluster would I start it from the NN/JT machine.  Could
 it also be started from any of the other cluster members.


It doesn't matter - Hadoop itself doesn't use SSH or anything. The daemons
just all have to be started somehow. If you're using the Cloudera
distribution with RPM/Deb you can use init scripts. If you prefer shell
scripts and ssh you can use the provided start-all scripts, your own
scripts, or something like pdssh or cap shell. If you're a masochist you can
log into each node individually and start the daemons by hand. I do not
recommend this last option :)


 sorry for all of the seemingly basic questions, but want to get it right
 the first time:)


Sure thing- we're here to help.

-Todd




 On Nov 9, 2009, at 1:11 PM, Todd Lipcon wrote:

  On Mon, Nov 9, 2009 at 7:20 AM, John Martyniak 
 j...@beforedawnsolutions.com

 wrote:



 Can the NameNode/DataNode  JobTracker/TaskTracker run on a server that
 isn't part of the cluster meaning I would like to run it on a machine
 that
 wouldn't participate in the processing of data, and wouldn't participate
 in
 the HDFS data sharing, and would solely focus on the NameNode/DataNode 
 JobTracker/TaskTracker tasks.


  Yes, running the NN and the JT on servers that don't also run TT/DN is
 very
 common and recommended for clusters of more than maybe 5 nodes.

 -Todd





Re: java.io.IOException: Could not obtain block:

2009-11-10 Thread Edmund Kohlwey

I've not encountered an error like this, but here's some suggestions:

1. Try to make sure that your two node cluster is setup correctly. 
Querying the web interface, using any of the included dfs utils (eg. 
hadoop dfs -ls), or looking in your log directory may yield more useful 
stack traces or errors.


2. Open up the source and check out the code around the stack trace. 
This sucks, but hadoop is actually pretty easy to surf through in 
Eclipse, and most classes are kept within a reasonable number of lines 
of code and fairly readable.


3. Rip out the parts of Nutch you need and drop them in your project, 
and forget about 0.19. This isn't ideal, but you have to remember that 
this whole ecosystem is still forming and sometimes it makes sense to 
rip stuff out and transplant it into your project rather than depending 
on 2-3 classes from a project which you otherwise don't use.


On 11/10/09 11:32 AM, John Martyniak wrote:

Hello everyone,

I am getting this error java.io.IOException: Could not obtain block:, 
when running on my new cluster.  When I ran the same job on the single 
node it worked perfectly, I then added in the second node, and receive 
this error.  I was running the grep sample job.


I am running Hadoop 0.19.2, because of a dependency on Nutch 
(Eventhough this was not a Nutch job).  I am not running HBase, the 
version of Java is OpenJDK 1.6.0.


Does anybody have any ideas?

Thanks in advance,

-John





Re: java.io.IOException: Could not obtain block:

2009-11-10 Thread John Martyniak

Edmund,

Thanks for the advice.  It turns out that it was the firewall running  
on the second cluster node.


So I stopped that and all is working correctly.  Now that I have the  
second node working the way that it is supposed to probably, going to  
bring another couple of nodes online.


Wish me luck:)

-John

On Nov 10, 2009, at 9:30 PM, Edmund Kohlwey wrote:


I've not encountered an error like this, but here's some suggestions:

1. Try to make sure that your two node cluster is setup correctly.  
Querying the web interface, using any of the included dfs utils (eg.  
hadoop dfs -ls), or looking in your log directory may yield more  
useful stack traces or errors.


2. Open up the source and check out the code around the stack trace.  
This sucks, but hadoop is actually pretty easy to surf through in  
Eclipse, and most classes are kept within a reasonable number of  
lines of code and fairly readable.


3. Rip out the parts of Nutch you need and drop them in your  
project, and forget about 0.19. This isn't ideal, but you have to  
remember that this whole ecosystem is still forming and sometimes it  
makes sense to rip stuff out and transplant it into your project  
rather than depending on 2-3 classes from a project which you  
otherwise don't use.


On 11/10/09 11:32 AM, John Martyniak wrote:

Hello everyone,

I am getting this error java.io.IOException: Could not obtain  
block:, when running on my new cluster.  When I ran the same job on  
the single node it worked perfectly, I then added in the second  
node, and receive this error.  I was running the grep sample job.


I am running Hadoop 0.19.2, because of a dependency on Nutch  
(Eventhough this was not a Nutch job).  I am not running HBase, the  
version of Java is OpenJDK 1.6.0.


Does anybody have any ideas?

Thanks in advance,

-John