Re: Debugging in Hadoop

2009-01-26 Thread Amareshwari Sriramadasu

patektek wrote:

Hello list, I am trying to add some functionality to Hadoop-core and I am
having serious issues
debugging it. I have searched in the list archive and still have not been
able to resolve the issues.

Simple question:
If I want to insert "LOG.INFO()" statements in Hadoop code is not that as
simple as  modifying
log4j.properties file to include the class which has the statements. For
example, if I want to
print out the LOG.info("I am here!") statements in MapTask. class
I would add to the lo4j.properites file the following line:


  
LOG.info statements in MapTask will be shown in syslog in task logs.  
The directory is ${hadoop.log.dir}/userlogs/.

The same can be browsed on the web ui of the task.

-Amareshwari

# Custom Logging levels
.
.
.
log4j.logger.org.apache.hadoop.mapred.MapTask=INFO

This approach is clearly not working for me.
What am I missing?

Thank you,
patektek

  




Re: Netbeans/Eclipse plugin

2009-01-26 Thread Amit k. Saha
On Tue, Jan 27, 2009 at 2:52 AM, Aaron Kimball  wrote:
> The Eclipse plugin (which, btw, is now part of Hadoop core in src/contrib/)
> currently is inoperable. The DFS viewer works, but the job submission code
> is broken.

I have started conversation with 3 other community members to work on
the NetBeans plugin. You can track the progress at
http://wiki.netbeans.org/Nbhadoop.

Best,
Amit


>
> - Aaron
>
> On Sun, Jan 25, 2009 at 9:07 PM, Amit k. Saha  wrote:
>
>> On Sun, Jan 25, 2009 at 9:32 PM, Edward Capriolo 
>> wrote:
>> > On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar 
>> wrote:
>> >> Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I
>> want
>> >> to make plugin for netbeans
>> >>
>> >> http://vinayakkatkar.wordpress.com
>> >> --
>> >> Vinayak Katkar
>> >> Sun Campus Ambassador
>> >> Sun Microsytems,India
>> >> COEP
>> >>
>> >
>> > There is an ecplipse plugin.
>> http://www.alphaworks.ibm.com/tech/mapreducetools
>> >
>> > Seems like some work is being done on netbeans
>> > https://nbhadoop.dev.java.net/
>>
>> I started this project. But well, its caught up in the requirements
>> gathering phase.
>>
>> @ Vinayak,
>>
>> Lets take this offline and discuss. What do you think?
>>
>>
>> Thanks,
>> Amit
>>
>> >
>> > The world needs more netbeans love.
>> >
>>
>> Definitely :-)
>>
>>
>> --
>> Amit Kumar Saha
>> http://amitksaha.blogspot.com
>> http://amitsaha.in.googlepages.com/
>> *Bangalore Open Java Users Group*:http:www.bojug.in
>>
>



-- 
Amit Kumar Saha
http://amitksaha.blogspot.com
http://amitsaha.in.googlepages.com/
*Bangalore Open Java Users Group*:http:www.bojug.in


Re: Zeroconf for hadoop

2009-01-26 Thread Vadim Zaliva
On Mon, Jan 26, 2009 at 11:22, Edward Capriolo  wrote:
> Zeroconf is more focused on simplicity then security. One of the
> original problems that may have been fixes is that any program can
> announce any service. IE my laptop can announce that it is the DNS for
> google.com etc.

I see two distinct tasks here:

1. Discovery
2. Authorization

Zeroconf would allow easy discovery of potential new nodes. Then if
they are configured with proper security credentials they could be
used in a cluster.

Vadim


DBOutputFormat and auto-generated keys

2009-01-26 Thread Vadim Zaliva
Is it possible to obtain auto-generated IDs when writing data using
DBOutputFormat?

For example, is it possible to write Mapper which stores records in DB
and returns auto-generated
IDs of these records?

Let me explain what I am trying to achieve:

I have data like this


which I would like to store in normalized for in two tables. First
table will store
keys (string). Each key will have unique int id auto-generated by mysql.

Second table will have (key_id,value) pairs, key_id being foreign key,
pointing to first table.

Sincerely,
Vadim


files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0

2009-01-26 Thread Yuanyuan Tian


Hi,

I just upgraded hadoop from 0.18.1 to 0.19.0 following the instructions on
http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run fsck,
everything seems fine. All the files can be listed in hdfs and the sizes
are also correct. But when a mapreduce job tries to read the files as
input, the following error messages are returned for some of the files:

java.io.IOException: Could not obtain block: blk_-2827537120880440835_1131
file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp
 at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.chooseDataNode(DFSClient.java:1708)
 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo
(DFSClient.java:1536)
 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read
(DFSClient.java:1663)
 at java.io.DataInputStream.read(DataInputStream.java:150)
 at java.io.ObjectInputStream$PeekInputStream.read
(ObjectInputStream.java:2283)
 at java.io.ObjectInputStream$PeekInputStream.readFully
(ObjectInputStream.java:2296)
 at java.io.ObjectInputStream$BlockDataInputStream.readShort
(ObjectInputStream.java:2767)
 at java.io.ObjectInputStream.readStreamHeader
(ObjectInputStream.java:798)
 at java.io.ObjectInputStream.(ObjectInputStream.java:298)
 at
emailanalytics.importer.parallelimport.EmailContentRecordReader.(EmailContentRecordReader.java:32)

 at
emailanalytics.importer.parallelimport.EmailContentFormat.getRecordReader
(EmailContentFormat.java:20)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at org.apache.hadoop.mapred.Child.main(Child.java:155)

I also tried to browse these files through the HDFS web interface,
java.io.EOFException is returned.

Is there any way to recover the files?

Thanks very much,

YY

Re: Mapred job parallelism

2009-01-26 Thread Aaron Kimball
Indeed, you will need to enable the Fair Scheduler or Capacity Scheduler
(which are both in 0.19) to do this. mapred.map.tasks is more a hint than
anything else -- if you have more files to map than you set this value to,
it will use more tasks than you configured the job to. The newer schedulers
will ensure that each job's many map tasks are only using a portion of the
available slots.
- Aaron

On Mon, Jan 26, 2009 at 1:43 PM, jason hadoop wrote:

> I believe that the schedule code in 0.19.0 has a framework for this, but I
> haven't dug into it in detail yet.
>
> http://hadoop.apache.org/core/docs/r0.19.0/capacity_scheduler.html
>
> From what I gather you would set up 2 queues, each with guaranteed access
> to
> 1/2 of the cluster
> Then you submit your jobs to alternate queues.
>
> This is not ideal as you have to balance what queue you submit jobs to, to
> ensure that there is some depth.
>
>
> On Mon, Jan 26, 2009 at 1:30 PM, Sagar Naik  wrote:
>
> > Hi Guys,
> >
> > I was trying to setup a cluster so that two jobs can run simultaneously.
> >
> > The conf :
> > number of nodes : 4(say)
> > mapred.tasktracker.map.tasks.maximum=2
> >
> >
> > and in the joblClient
> > mapred.map.tasks=4 (# of nodes)
> >
> >
> > I also have a condition, that each job should have only one map-task per
> > node
> >
> > In short, created 8 map slots and set the number of mappers to 4.
> > So now, we have two jobs running simultaneously
> >
> > However, I realized that, if a tasktracker happens to die, potentially, I
> > will have 2 map-tasks running on a node
> >
> >
> > Setting mapred.tasktracker.map.tasks.maximum=1 in Jobclient has no
> effect.
> > It is tasktracker property and cant be changed per job
> >
> > Any ideas on how to have 2 jobs running simultaneously ?
> >
> >
> > -Sagar
> >
> >
> >
> >
> >
> >
> >
>


Re: Mapred job parallelism

2009-01-26 Thread jason hadoop
I believe that the schedule code in 0.19.0 has a framework for this, but I
haven't dug into it in detail yet.

http://hadoop.apache.org/core/docs/r0.19.0/capacity_scheduler.html

>From what I gather you would set up 2 queues, each with guaranteed access to
1/2 of the cluster
Then you submit your jobs to alternate queues.

This is not ideal as you have to balance what queue you submit jobs to, to
ensure that there is some depth.


On Mon, Jan 26, 2009 at 1:30 PM, Sagar Naik  wrote:

> Hi Guys,
>
> I was trying to setup a cluster so that two jobs can run simultaneously.
>
> The conf :
> number of nodes : 4(say)
> mapred.tasktracker.map.tasks.maximum=2
>
>
> and in the joblClient
> mapred.map.tasks=4 (# of nodes)
>
>
> I also have a condition, that each job should have only one map-task per
> node
>
> In short, created 8 map slots and set the number of mappers to 4.
> So now, we have two jobs running simultaneously
>
> However, I realized that, if a tasktracker happens to die, potentially, I
> will have 2 map-tasks running on a node
>
>
> Setting mapred.tasktracker.map.tasks.maximum=1 in Jobclient has no effect.
> It is tasktracker property and cant be changed per job
>
> Any ideas on how to have 2 jobs running simultaneously ?
>
>
> -Sagar
>
>
>
>
>
>
>


Mapred job parallelism

2009-01-26 Thread Sagar Naik

Hi Guys,

I was trying to setup a cluster so that two jobs can run simultaneously.

The conf :
number of nodes : 4(say)
mapred.tasktracker.map.tasks.maximum=2


and in the joblClient
mapred.map.tasks=4 (# of nodes)


I also have a condition, that each job should have only one map-task per 
node


In short, created 8 map slots and set the number of mappers to 4.
So now, we have two jobs running simultaneously

However, I realized that, if a tasktracker happens to die, potentially, 
I will have 2 map-tasks running on a node



Setting mapred.tasktracker.map.tasks.maximum=1 in Jobclient has no 
effect. It is tasktracker property and cant be changed per job


Any ideas on how to have 2 jobs running simultaneously ?


-Sagar








Re: Netbeans/Eclipse plugin

2009-01-26 Thread Aaron Kimball
The Eclipse plugin (which, btw, is now part of Hadoop core in src/contrib/)
currently is inoperable. The DFS viewer works, but the job submission code
is broken.

- Aaron

On Sun, Jan 25, 2009 at 9:07 PM, Amit k. Saha  wrote:

> On Sun, Jan 25, 2009 at 9:32 PM, Edward Capriolo 
> wrote:
> > On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar 
> wrote:
> >> Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I
> want
> >> to make plugin for netbeans
> >>
> >> http://vinayakkatkar.wordpress.com
> >> --
> >> Vinayak Katkar
> >> Sun Campus Ambassador
> >> Sun Microsytems,India
> >> COEP
> >>
> >
> > There is an ecplipse plugin.
> http://www.alphaworks.ibm.com/tech/mapreducetools
> >
> > Seems like some work is being done on netbeans
> > https://nbhadoop.dev.java.net/
>
> I started this project. But well, its caught up in the requirements
> gathering phase.
>
> @ Vinayak,
>
> Lets take this offline and discuss. What do you think?
>
>
> Thanks,
> Amit
>
> >
> > The world needs more netbeans love.
> >
>
> Definitely :-)
>
>
> --
> Amit Kumar Saha
> http://amitksaha.blogspot.com
> http://amitsaha.in.googlepages.com/
> *Bangalore Open Java Users Group*:http:www.bojug.in
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Jason, this is awesome, thank you.
By the way, is there a book or manual with "best practices?"

On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop wrote:

> Sequence files rock, and you can use the
> *
> bin/hadoop dfs -text FILENAME* command line tool to get a toString level
> unpacking of the sequence file key,value pairs.
>
> If you provide your own key or value classes, you will need to implement a
> toString method to get some use out of this. Also, your class path will
> need
> to include the jars with your custom key/value classes.
>
> HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME*
>
>
> On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner 
> wrote:
>
> > Thank you, Doug, then all is clear in my head.
> > Mark
> >
> > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting 
> wrote:
> >
> > > Mark Kerzner wrote:
> > >
> > >> Okay, I am convinced. I only noticed that Doug, the originator, was
> not
> > >> happy about it - but in open source one has to give up control
> > sometimes.
> > >>
> > >
> > > I think perhaps you misunderstood my remarks.  My point was that, if
> you
> > > looked to Nutch's Content class for an example, it is, for historical
> > > reasons, somewhat more complicated than it needs to be and is thus a
> less
> > > than perfect example.  But using SequenceFile to store web content is
> > > certainly a best practice and I did not mean to imply otherwise.
> > >
> > > Doug
> > >
> >
>


Re: What happens in HDFS DataNode recovery?

2009-01-26 Thread Aaron Kimball
Also, see the balancer tool that comes with Hadoop. This background process
should be run periodically (Every week or so?) to make sure that data's
evenly distributed.

http://hadoop.apache.org/core/docs/r0.19.0/hdfs_user_guide.html#Rebalancer

- Aaron

On Sat, Jan 24, 2009 at 7:40 PM, jason hadoop wrote:

> The blocks will be invalidated on the returned to service datanode.
> If you want to save your namenode and network a lot of work, wipe the hdfs
> block storage directory before returning the Datanode to service.
> dfs.data.dir will be the directory, most likley the value is
> ${hadoop.tmp.dir}/dfs/data
>
> Jason - Ex Attributor
>
> On Sat, Jan 24, 2009 at 6:19 PM, C G  wrote:
>
> > Hi All:
> >
> > I elected to take a node out of one of our grids for service.  Naturally
> > HDFS recognized the loss of the DataNode and did the right stuff, fixing
> > replication issues and ultimately delivering a clean file system.
> >
> > So now the node I removed is ready to go back in service.  When I return
> it
> > to service a bunch of files will suddenly have a replication of 4 instead
> of
> > 3.  My questions:
> >
> > 1.  Will HDFS delete a copy of the data to bring replication back to 3?
> > 2.  If (1) above is  yes, will it remove the copy by deleting from other
> > nodes, or will it remove files from the returned node, or both?
> >
> > The motivation for asking the questions are that I have a file system
> which
> > is extremely unbalanced - we recently doubled the size of the grid when a
> > few dozen terabytes already stored on the existing nodes.  I am wondering
> if
> > an easy way to restore some sense of balance is to cycle through the old
> > nodes, removing each one from service for several hours and then return
> it
> > to service.
> >
> > Thoughts?
> >
> > Thanks in Advance,
> > C G
> >
> >
> >
> >
> >
> >
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
Sequence files rock, and you can use the
*
bin/hadoop dfs -text FILENAME* command line tool to get a toString level
unpacking of the sequence file key,value pairs.

If you provide your own key or value classes, you will need to implement a
toString method to get some use out of this. Also, your class path will need
to include the jars with your custom key/value classes.

HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME*


On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner  wrote:

> Thank you, Doug, then all is clear in my head.
> Mark
>
> On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting  wrote:
>
> > Mark Kerzner wrote:
> >
> >> Okay, I am convinced. I only noticed that Doug, the originator, was not
> >> happy about it - but in open source one has to give up control
> sometimes.
> >>
> >
> > I think perhaps you misunderstood my remarks.  My point was that, if you
> > looked to Nutch's Content class for an example, it is, for historical
> > reasons, somewhat more complicated than it needs to be and is thus a less
> > than perfect example.  But using SequenceFile to store web content is
> > certainly a best practice and I did not mean to imply otherwise.
> >
> > Doug
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Thank you, Doug, then all is clear in my head.
Mark

On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting  wrote:

> Mark Kerzner wrote:
>
>> Okay, I am convinced. I only noticed that Doug, the originator, was not
>> happy about it - but in open source one has to give up control sometimes.
>>
>
> I think perhaps you misunderstood my remarks.  My point was that, if you
> looked to Nutch's Content class for an example, it is, for historical
> reasons, somewhat more complicated than it needs to be and is thus a less
> than perfect example.  But using SequenceFile to store web content is
> certainly a best practice and I did not mean to imply otherwise.
>
> Doug
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting

Mark Kerzner wrote:

Okay, I am convinced. I only noticed that Doug, the originator, was not
happy about it - but in open source one has to give up control sometimes.


I think perhaps you misunderstood my remarks.  My point was that, if you 
looked to Nutch's Content class for an example, it is, for historical 
reasons, somewhat more complicated than it needs to be and is thus a 
less than perfect example.  But using SequenceFile to store web content 
is certainly a best practice and I did not mean to imply otherwise.


Doug


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Okay, I am convinced. I only noticed that Doug, the originator, was not
happy about it - but in open source one has to give up control sometimes.
Thank you,
Mark

On Mon, Jan 26, 2009 at 2:36 PM, Andy Liu  wrote:

> SequenceFile supports transparent block-level compression out of the box,
> so
> you don't have to compress data in your code.
>
> Most the time, compression not only saves disk space but improves
> performance because there's less data to write.
>
> Andy
>
> On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner  >wrote:
>
> > Doug,
> > SequenceFile looks like a perfect candidate to use in my project, but are
> > you saying that I better use uncompressed data if I am not interested in
> > saving disk space?
> >
> > Thank you,
> > Mark
> >
> > On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting 
> wrote:
> >
> > > Philip (flip) Kromer wrote:
> > >
> > >> Heretrix ,
> > >> Nutch,
> > >> others use the ARC file format
> > >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> > >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> > >>
> > >
> > > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > > store crawled pages.  The keys of crawl content files are URLs and the
> > > values are:
> > >
> > >
> > >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> > >
> > > I believe that the implementation of this class pre-dates
> SequenceFile's
> > > support for compressed values, so the values are decompressed on
> demand,
> > > which needlessly complicates its implementation and API.  It's
> basically
> > a
> > > Writable that stores binary content plus headers, typically an HTTP
> > > response.
> > >
> > > Doug
> > >
> >
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Andy Liu
SequenceFile supports transparent block-level compression out of the box, so
you don't have to compress data in your code.

Most the time, compression not only saves disk space but improves
performance because there's less data to write.

Andy

On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner wrote:

> Doug,
> SequenceFile looks like a perfect candidate to use in my project, but are
> you saying that I better use uncompressed data if I am not interested in
> saving disk space?
>
> Thank you,
> Mark
>
> On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting  wrote:
>
> > Philip (flip) Kromer wrote:
> >
> >> Heretrix ,
> >> Nutch,
> >> others use the ARC file format
> >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> >>
> >
> > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > store crawled pages.  The keys of crawl content files are URLs and the
> > values are:
> >
> >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> >
> > I believe that the implementation of this class pre-dates SequenceFile's
> > support for compressed values, so the values are decompressed on demand,
> > which needlessly complicates its implementation and API.  It's basically
> a
> > Writable that stores binary content plus headers, typically an HTTP
> > response.
> >
> > Doug
> >
>


Re: Hadoop 0.19 over OS X : dfs error

2009-01-26 Thread nitesh bhatia
Well its strange.. although I changed default JAVA environment to Java 6
64bit but still my /Library/Java/Home was pointing to java 5. So in
config/hadoop_env.sh I changed the path of JAVA_HOME to actual path i.e
/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home .Its working
now.



On Mon, Jan 26, 2009 at 3:01 PM, Raghu Angadi  wrote:

> nitesh bhatia wrote:
>
>> Thanks. It worked. :) in hadoop-env.sh its required to write exact path
>> for
>> java framework. I changed it to
>> export
>> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
>> and it started.
>>
>> In hadoop 0.18.2 export JAVA_HOME=/Library/Java/Home is working fine. I am
>> confused why we need to give exact path in 0.19 version.
>>
>
> Most likely reason is that your /Library/Java/Home some how ends up using
> JDK 1.5. 0.19 and up require JDK 1.6.x.
>
> Raghu.
>
>
>  Thankyou
>>
>> --nitesh
>>
>> On Sun, Jan 25, 2009 at 1:52 PM, Joerg Rieger <
>> joerg.rie...@mni.fh-giessen.de> wrote:
>>
>>  Hello,
>>>
>>> what path did you set in conf/hadoop-env.sh?
>>>
>>> Before Hadoop 0.19 I had in hadoop-env.sh:
>>> export JAVA_HOME=/Library/Java/Home
>>>
>>> But that path, despite using java-preferences to change Java versions,
>>> still uses the Java 1.5 version, e.g.:
>>>
>>> $ /Library/Java/Home/bin/java -version
>>> java version "1.5.0_16"
>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284)
>>> Java HotSpot(TM) Client VM (build 1.5.0_16-133, mixed mode, sharing)
>>>
>>> You have to change the setting to:
>>> export
>>> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
>>>
>>>
>>> Joerg
>>>
>>>
>>> On 25.01.2009, at 00:16, nitesh bhatia wrote:
>>>
>>>  Hi
>>>
 My current default settings are  for java 1.6

 nMac:hadoop-0.19.0 Aryan$ $JAVA_HOME/bin/java -version
 java version "1.6.0_07"
 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
 Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)


 The system is working fine with Hadoop 0.18.2.

 --nitesh

 On Sun, Jan 25, 2009 at 4:15 AM, Craig Macdonald >>>
> wrote:
>
  Hi,

> I guess that the java on your PATH is different from the setting of
> your
> $JAVA_HOME env variable.
> Try $JAVA_HOME/bin/java -version?
>
> Also, there is a program called Java Preferences on each system for
> changing the default java version used.
>
> Craig
>
>
> nitesh bhatia wrote:
>
>  Hi
>
>> I am trying to setup Hadoop 0.19 on OS X. Current Java Version is
>>
>> java version "1.6.0_07"
>> Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
>> Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)
>>
>> When I am trying to format dfs  using "bin/hadoop dfs -format"
>> command.
>> I
>> am
>> getting following errors:
>>
>> nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format
>> Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad
>> version number in .class file
>>  at java.lang.ClassLoader.defineClass1(Native Method)
>>  at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
>>  at
>>
>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
>>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
>>  at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
>>  at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
>> Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad
>> version number in .class file
>>  at java.lang.ClassLoader.defineClass1(Native Method)
>>  at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
>>  at
>>
>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
>>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
>>  at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
>>  at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
>>
>>
>> I am not sure why this error is comi

Re: Hadoop 0.19 over OS X : dfs error

2009-01-26 Thread Raghu Angadi

nitesh bhatia wrote:

Thanks. It worked. :) in hadoop-env.sh its required to write exact path for
java framework. I changed it to
export
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
and it started.

In hadoop 0.18.2 export JAVA_HOME=/Library/Java/Home is working fine. I am
confused why we need to give exact path in 0.19 version.


Most likely reason is that your /Library/Java/Home some how ends up 
using JDK 1.5. 0.19 and up require JDK 1.6.x.


Raghu.


Thankyou

--nitesh

On Sun, Jan 25, 2009 at 1:52 PM, Joerg Rieger <
joerg.rie...@mni.fh-giessen.de> wrote:


Hello,

what path did you set in conf/hadoop-env.sh?

Before Hadoop 0.19 I had in hadoop-env.sh:
export JAVA_HOME=/Library/Java/Home

But that path, despite using java-preferences to change Java versions,
still uses the Java 1.5 version, e.g.:

$ /Library/Java/Home/bin/java -version
java version "1.5.0_16"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284)
Java HotSpot(TM) Client VM (build 1.5.0_16-133, mixed mode, sharing)

You have to change the setting to:
export
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home


Joerg


On 25.01.2009, at 00:16, nitesh bhatia wrote:

 Hi

My current default settings are  for java 1.6

nMac:hadoop-0.19.0 Aryan$ $JAVA_HOME/bin/java -version
java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)


The system is working fine with Hadoop 0.18.2.

--nitesh

On Sun, Jan 25, 2009 at 4:15 AM, Craig Macdonald 
wrote:

 Hi,

I guess that the java on your PATH is different from the setting of your
$JAVA_HOME env variable.
Try $JAVA_HOME/bin/java -version?

Also, there is a program called Java Preferences on each system for
changing the default java version used.

Craig


nitesh bhatia wrote:

 Hi

I am trying to setup Hadoop 0.19 on OS X. Current Java Version is

java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

When I am trying to format dfs  using "bin/hadoop dfs -format" command.
I
am
getting following errors:

nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad
version number in .class file
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
 at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad
version number in .class file
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
 at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)


I am not sure why this error is coming. I am having latest Java version.
Can
anyone help me out with this?

Thanks
Nitesh







--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


--











Re: Zeroconf for hadoop

2009-01-26 Thread Raghu Angadi

nitesh bhatia wrote:

Hi
Apple provides opensource discovery service called Bonjour (zeroconf). Is it
possible to integrate Zeroconf with Hadoop so that discovery of nodes become
automatic ? Presently for setting up multi-node cluster we need to add IPs
manually. Integrating it with bonjour can make this process automatic.


It would be nice to have. Note that it is the slaves (DataNodes, 
TaskTrackers) that need to do the discovery. NameNode just passively 
accepts the DataNodes that want to be part of the cluster.


In that sense, NN should announce itself and DNs try to find where the 
NN is. It will nice to have zeroconf feature in some form and we might 
discover many more uses for it. Of course, a cluster should not require it.


Raghu.


Re: Zeroconf for hadoop

2009-01-26 Thread Edward Capriolo
Zeroconf is more focused on simplicity then security. One of the
original problems that may have been fixes is that any program can
announce any service. IE my laptop can announce that it is the DNS for
google.com etc.

I want to mention a related topic to the list. People are approaching
the auto-discovery in a number of ways jira. There are a few ways I
can think of to discover hadoop. A very simple way might be to publish
the configuration over a web interface. I use a network storage system
called gluster-fs. Gluster can be configured so the server holds the
configuration for each client. If the hadoop name node held the entire
configuration for all the nodes the namenode would only need to be
aware of the namenode and it could retrieve its configuration from it.

Having a central configuration management or a discovery system would
be very useful. HOD is what I think to be the closest thing it is more
of a top down deployment system.


Re: Zeroconf for hadoop

2009-01-26 Thread nitesh bhatia
For a closed uniform system (yahoo, google), this can work best. This can
provide plug-n-play type of system. Through this we can change clusters to
dynamic grids. But I am not sure of outcome so far, I am reading the
documentation.

--nitesh


On Mon, Jan 26, 2009 at 1:59 PM, Allen Wittenauer  wrote:

>
>
>
> On 1/25/09 8:45 AM, "nitesh bhatia"  wrote:
> > Apple provides opensource discovery service called Bonjour (zeroconf). Is
> it
> > possible to integrate Zeroconf with Hadoop so that discovery of nodes
> become
> > automatic ? Presently for setting up multi-node cluster we need to add
> IPs
> > manually. Integrating it with bonjour can make this process automatic.
>
> How do you deal with multiple grids?
>
> How do you deal with security?
>
>


-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Re: Zeroconf for hadoop

2009-01-26 Thread Raghu Angadi

Nitay wrote:

Why not use the distributed coordination service ZooKeeper? When nodes come
up they write some ephemeral file in a known ZooKeeper directory and anyone
who's interested, i.e. NameNode, can put a watch on the directory and get
notified when new children come up.


NameNode does not do active discovery. It is the DataNodes that contact 
NameNode about their presence. So with ZooKeeper or zeroconf, DataNode 
should be able to discover who their NN is and connect to it.


Raghu.


On Mon, Jan 26, 2009 at 10:59 AM, Allen Wittenauer  wrote:




On 1/25/09 8:45 AM, "nitesh bhatia"  wrote:

Apple provides opensource discovery service called Bonjour (zeroconf). Is

it

possible to integrate Zeroconf with Hadoop so that discovery of nodes

become

automatic ? Presently for setting up multi-node cluster we need to add

IPs

manually. Integrating it with bonjour can make this process automatic.

How do you deal with multiple grids?

How do you deal with security?








Re: Zeroconf for hadoop

2009-01-26 Thread Nitay
Why not use the distributed coordination service ZooKeeper? When nodes come
up they write some ephemeral file in a known ZooKeeper directory and anyone
who's interested, i.e. NameNode, can put a watch on the directory and get
notified when new children come up.

On Mon, Jan 26, 2009 at 10:59 AM, Allen Wittenauer  wrote:

>
>
>
> On 1/25/09 8:45 AM, "nitesh bhatia"  wrote:
> > Apple provides opensource discovery service called Bonjour (zeroconf). Is
> it
> > possible to integrate Zeroconf with Hadoop so that discovery of nodes
> become
> > automatic ? Presently for setting up multi-node cluster we need to add
> IPs
> > manually. Integrating it with bonjour can make this process automatic.
>
> How do you deal with multiple grids?
>
> How do you deal with security?
>
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Raghu Angadi

Mark Kerzner wrote:

Raghu,

if I write all files only one, is the cost the same in one directory or do I
need to find the optimal directory size and when full start another
"bucket?"


If you write only once, then writing won't be much of an issue. You can 
write them in lexical order to help with buffer copies. These are all 
implementation details that a user should not depend on.


That said, the rest of the discussion in this thread is going in the 
right direction : to get you to use fewer files that combines a lot of 
these small files.


Large number of small files has overhead in many places in HDFS : strain 
on DataNodes, NameNode memory, etc.


Raghu.



Re: Zeroconf for hadoop

2009-01-26 Thread Allen Wittenauer



On 1/25/09 8:45 AM, "nitesh bhatia"  wrote:
> Apple provides opensource discovery service called Bonjour (zeroconf). Is it
> possible to integrate Zeroconf with Hadoop so that discovery of nodes become
> automatic ? Presently for setting up multi-node cluster we need to add IPs
> manually. Integrating it with bonjour can make this process automatic.

How do you deal with multiple grids?

How do you deal with security?



Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
We like compression if the data is readily compressible and large as it
saves on IO time.


On Mon, Jan 26, 2009 at 9:35 AM, Mark Kerzner  wrote:

> Doug,
> SequenceFile looks like a perfect candidate to use in my project, but are
> you saying that I better use uncompressed data if I am not interested in
> saving disk space?
>
> Thank you,
> Mark
>
> On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting  wrote:
>
> > Philip (flip) Kromer wrote:
> >
> >> Heretrix ,
> >> Nutch,
> >> others use the ARC file format
> >>  http://www.archive.org/web/researcher/ArcFileFormat.php
> >>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> >>
> >
> > Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> > store crawled pages.  The keys of crawl content files are URLs and the
> > values are:
> >
> >
> >
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
> >
> > I believe that the implementation of this class pre-dates SequenceFile's
> > support for compressed values, so the values are decompressed on demand,
> > which needlessly complicates its implementation and API.  It's basically
> a
> > Writable that stores binary content plus headers, typically an HTTP
> > response.
> >
> > Doug
> >
>


setNumTasksToExecutePerJvm and Configure

2009-01-26 Thread Saptarshi Guha

Hello,
	Suppose I set setNumTasksToExecutePerJvm to -1. Then, the same jvm  
may run several tasks consecutively.
1)	Will the configure method(if present) be run for every task? Or  
only for the first task that the jvm runs?
2)Similarly, the close method(if present) will be run for the / 
last/ task run by that jvm?


Thank you
Saptarshi

Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha



Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Doug,
SequenceFile looks like a perfect candidate to use in my project, but are
you saying that I better use uncompressed data if I am not interested in
saving disk space?

Thank you,
Mark

On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting  wrote:

> Philip (flip) Kromer wrote:
>
>> Heretrix ,
>> Nutch,
>> others use the ARC file format
>>  http://www.archive.org/web/researcher/ArcFileFormat.php
>>  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
>>
>
> Nutch does not use ARC format but rather uses Hadoop's SequenceFile to
> store crawled pages.  The keys of crawl content files are URLs and the
> values are:
>
>
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html
>
> I believe that the implementation of this class pre-dates SequenceFile's
> support for compressed values, so the values are decompressed on demand,
> which needlessly complicates its implementation and API.  It's basically a
> Writable that stores binary content plus headers, typically an HTTP
> response.
>
> Doug
>


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting

Philip (flip) Kromer wrote:

Heretrix ,
Nutch,
others use the ARC file format
  http://www.archive.org/web/researcher/ArcFileFormat.php
  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml


Nutch does not use ARC format but rather uses Hadoop's SequenceFile to 
store crawled pages.  The keys of crawl content files are URLs and the 
values are:


http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html

I believe that the implementation of this class pre-dates SequenceFile's 
support for compressed values, so the values are decompressed on demand, 
which needlessly complicates its implementation and API.  It's basically 
a Writable that stores binary content plus headers, typically an HTTP 
response.


Doug


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Steve Loughran

Philip (flip) Kromer wrote:

I ran in this problem, hard, and I can vouch that this is not a windows-only
problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
than a few hundred thousand files in the same directory. (The operation to
correct this mistake took a week to run.)  That is one of several hard
lessons I learned about "don't write your scraper to replicate the path
structure of each document as a file on disk."


I've seen a fair few machines (one of the network store programs) top 
out at 65K files/dir; shows while it is good to test your assumptions 
before you go live.