Re: JNI and calling Hadoop jar files

2009-03-23 Thread Jeff Eastman
This looks somewhat similar to my Subtle Classloader Issue from 
yesterday. I'll be watching this thread too.


Jeff

Saptarshi Guha wrote:

Hello,
I'm using some JNI interfaces, via a R. My classpath contains all the
jar files in $HADOOP_HOME and $HADOOP_HOME/lib
My class is
public SeqKeyList() throws Exception {

config = new  org.apache.hadoop.conf.Configuration();
config.addResource(new Path(System.getenv("HADOOP_CONF_DIR")
+"/hadoop-default.xml"));
config.addResource(new Path(System.getenv("HADOOP_CONF_DIR")
+"/hadoop-site.xml"));

System.out.println("C="+config);
filesystem = FileSystem.get(config);
System.out.println("C="+config+"F=" +filesystem);
System.out.println(filesystem.getUri().getScheme());

}

I am using a distributed filesystem
(org.apache.hadoop.hdfs.DistributedFileSystem for fs.hdfs.impl).
When run from the command line and this class is created everything works fine
When called using jni I get
 java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem

Is this a jni issue? How can it work from the commandline using the
same classpath, yet throw this is exception when run via JNI?
Saptarshi Guha


  




Subtle Classloader Issue

2009-03-22 Thread Jeff Eastman
I'm trying to run the Dirichlet clustering example from 
(http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html). The command 
line:


$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job


... loads our example jar file which contains the following structure:

>jar -tf mahout-examples-0.1.job
META-INF/
...
org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModelDistribution.class
org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.class
...
lib/mahout-core-0.1-tests.jar
lib/mahout-core-0.1.jar
lib/hadoop-core-0.19.1.jar
...

The dirichlet/Job first runs a map-reduce job to convert the input data 
into Mahout Vector format and then runs the DirichletDriver.runJob() 
method contained in the lib/mahout-core-0.1.jar. This method calls 
DirichletDriver.createState() which initializes a 
NormalScModelDistribution with a set of NormalScModels that represent 
the prior state of the clustering. This state is then written to HDFS 
and the job begins running the iterations which assign input data points 
to the models. So far so good.


 public static DirichletState createState(String modelFactory, 
int numModels, double alpha_0) throws
   ClassNotFoundException, InstantiationException, 
IllegalAccessException {

   ClassLoader ccl = Thread.currentThread().getContextClassLoader();
   Class cl = ccl.loadClass(modelFactory);
   ModelDistribution factory = (ModelDistribution) 
cl.newInstance();
   DirichletState state = new DirichletState(factory, 
numModels, alpha_0, 1, 1);

   return state;
 }


In the DirichletMapper, also in the lib/mahout jar, the configure() 
method reads in the current model state by calling 
DirichletDriver.createState(). In this invocation; however, it throws a 
CNF exception.


09/03/22 09:33:03 INFO mapred.JobClient: Task Id : 
attempt_200903211441_0025_m_00_2, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.NormalScModelDistribution
   at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.getDirichletState(DirichletMapper.java:97)
   at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.configure(DirichletMapper.java:61)


The kMeans job, which uses the same class loader code to load its 
distance measure in similar driver code, works fine. The difference is 
that the referenced distance measure is contained in the 
mahout-core-0.1.jar, not the mahout-examples-0.1.job. Both jobs run fine 
in test mode from Eclipse.


It would seem that there is some subtle difference in the class loader 
structures used by the DirichletDriver and DirichletMapper process 
invocations. In the former, the driver code is called by code living in 
the example jar; in the latter the driver code is called by code living 
in the mahout jar. Its like the first case can see in to the lib/mahout 
classes but the second cannot see out to the classes in the example jar.


Can anybody clarify what is going on and how to fix it?

Jeff



Re: RecordReader design heuristic

2009-03-18 Thread Jeff Eastman

Hi Josh,
It seemed like you had a conceptual wire crossed and I'm glad to help 
out. The neat thing about Hadoop mappers is - since they are given a 
replicated HDFS block to munch on - the job scheduler has factor> number of node choices where it can run each mapper. This means 
mappers are always reading from local storage.


On another note, I notice you are processing what looks to be large 
quantities of vector data. If you have any interest in clustering this 
data you might want to look at the Mahout project 
(http://lucene.apache.org/mahout/). We have a number of Hadoop-ready 
clustering algorithms, including a new non-parametric Dirichlet Process 
Clustering implementation that I committed recently. We are pulling it 
all together for a 0.1 release and I would be very interested in helping 
you to apply these algorithms if you have an interest.


Jeff


Patterson, Josh wrote:

Jeff,
ok, that makes more sense, I was under the mis-impression that it was creating 
and destroying mappers for each input record. I dont know why I had that in my 
head. My design suddenly became a lot clearer, and this provides a much more 
clean abstraction. Thanks for your help!

Josh Patterson
TVA

  




Re: RecordReader design heuristic

2009-03-17 Thread Jeff Eastman

Hi Josh,

Well, I don't really see how you will get more mappers, just simpler 
logic in the mapper. The number of mappers is driven by how many input 
files you have and their sizes and not by any chunking you do in the 
record reader. Each record reader will get an entire split and will feed 
it to its mapper in a stream one record at a time. You can duplicate 
some of that logic in the mapper if you want but you already will have 
it in the reader so why bother?


Jeff


Patterson, Josh wrote:

Jeff,
So if I'm hearing you right, its "good" to send one point of data (10
bytes here) to a single mapper? This mind set increases the number of
mappers, but keeps their logic scaled down to simply "look at this
record and emit/don't emit" --- which is considered more favorable? I'm
still getting the hang of the MR design tradeoffs, thanks for your
feedback.

Josh Patterson
TVA

-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Tuesday, March 17, 2009 5:11 PM

To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

If you send a single point to the mapper, your mapper logic will be 
clean and simple. Otherwise you will need to loop over your block of 
points in the mapper. In Mahout clustering, I send the mapper individual


points because the input file is point-per-line. In either case, the 
record reader will be iterating over a block of data to provide mapper 
inputs. IIRC, splits will generally be an HDFS block or less, so if you 
have files smaller than that you will get one mapper per. For larger 
files you can get up to one mapper per split block.


Jeff

Patterson, Josh wrote:
  

I am currently working on a RecordReader to read a custom time series
data binary file format and was wondering about ways to be most
efficient in designing the InputFormat/RecordReader process. Reading
through:
 
http://wiki.apache.org/hadoop/HadoopMapReduce
<http://wiki.apache.org/hadoop/HadoopMapReduce> 
 
gave me a lot of hints about how the various classes work together in

order to read any type of file. I was looking at how the


TextInputFormat
  

uses the LineRecordReader in order to send individual lines to each
mapper. My question is, what is a good heuristic in how to choose how
much data to send to each mapper? With the stock LineRecordReader each
mapper only gets to work with a single line which leads me to believe
that we want to give each mapper very little work. Currently I'm


looking
  

at either sending each mapper a single point of data (10 bytes), which
seems small, or sending a single mapper a block of data (around 819
points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards


sending
  

the block to the mapper.
 
These factors are based around dealing with a legacy file format (for

now) so I'm just trying to make the best tradeoff possible for the


short
  

term until I get some basic stuff rolling, at which point I can


suggest
  

a better storage format, or just start converting the groups of stored
points into a format more fitting for the platform. I understand that
the InputFormat is not really trying to make much meaning out of the
data, other than to help assist in getting the correct data out of the
file based on the file split variables. Another question I have is,


with
  

a pretty much stock install, generally how big is each FileSplit?
 
Josh Patterson

TVA

  





  




Re: RecordReader design heuristic

2009-03-17 Thread Jeff Eastman
If you send a single point to the mapper, your mapper logic will be 
clean and simple. Otherwise you will need to loop over your block of 
points in the mapper. In Mahout clustering, I send the mapper individual 
points because the input file is point-per-line. In either case, the 
record reader will be iterating over a block of data to provide mapper 
inputs. IIRC, splits will generally be an HDFS block or less, so if you 
have files smaller than that you will get one mapper per. For larger 
files you can get up to one mapper per split block.


Jeff

Patterson, Josh wrote:

I am currently working on a RecordReader to read a custom time series
data binary file format and was wondering about ways to be most
efficient in designing the InputFormat/RecordReader process. Reading
through:
 
http://wiki.apache.org/hadoop/HadoopMapReduce
 
 
gave me a lot of hints about how the various classes work together in

order to read any type of file. I was looking at how the TextInputFormat
uses the LineRecordReader in order to send individual lines to each
mapper. My question is, what is a good heuristic in how to choose how
much data to send to each mapper? With the stock LineRecordReader each
mapper only gets to work with a single line which leads me to believe
that we want to give each mapper very little work. Currently I'm looking
at either sending each mapper a single point of data (10 bytes), which
seems small, or sending a single mapper a block of data (around 819
points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards sending
the block to the mapper.
 
These factors are based around dealing with a legacy file format (for

now) so I'm just trying to make the best tradeoff possible for the short
term until I get some basic stuff rolling, at which point I can suggest
a better storage format, or just start converting the groups of stored
points into a format more fitting for the platform. I understand that
the InputFormat is not really trying to make much meaning out of the
data, other than to help assist in getting the correct data out of the
file based on the file split variables. Another question I have is, with
a pretty much stock install, generally how big is each FileSplit?
 
Josh Patterson

TVA

  




Re: Users Group Meeting Slides

2008-05-22 Thread Jeff Eastman

Hi Tanton,

We have canopy and kmeans in trunk that you are welcome to use 
(https://svn.apache.org/repos/asf/lucene/mahout/trunk). Please post any 
usability questions you may have or suggestions for improvements to the 
Mahout user list ([EMAIL PROTECTED]). There is some 
documentation on the wiki 
(http://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering) 
and you can inspect the unit tests for more information. I'm in the 
process of building some examples and those will get committed to trunk 
in a week or so.


Smooth sailing,
Jeff

Tanton Gibbs wrote:

I checked out the Wiki.  I am in need of a canopy clustering algorithm
for hadoop.   I'm about to embark on writing one, but if you have one
already, that would be better.  It doesn't have to be perfect, I can
improve on it as I go.  However, I couldn't find the svn repository
for Mahout...any pointers to where I can find the code?

Thanks!
Tanton

On Thu, May 22, 2008 at 11:36 AM, Jeff Eastman
<[EMAIL PROTECTED]> wrote:
  

I uploaded the slides from my Mahout overview to our wiki
(http://cwiki.apache.org/confluence/display/MAHOUT/FAQ) along with another
recent talk by Isabel Drost. Both are similar in content but their
differences reflect the rapid evolution of the project in the month that
separates them in time. After I got home I worried a bit that I had skipped
over too much of the material in an effort to keep it brief.

Thanks to Yahoo! for hosting the meeting and providing beer and pizza for
the 50 or so friends of Hadoop who attended. I thought the meeting was fun
and informative and a great way to begin to associate faces and richer
personalities with the names that fly through my in box. For those of you
who were not able to attend, the meeting was quite informal and we got
brief, mostly extemporaneous, updates of some of the projects that were on
the agenda at the recent Hadoop Summit: Hadoop 0.17, HBase, Pig, Zookeeper,
Mahout, ...

In the wrapup, people seemed to like this agenda and I think Ajay plans to
continue with the same general format in subsequent monthly meetings. Due to
the informality, I did not take notes. Perhaps the other presenters can post
summaries of their updates for the wider community's appreciation.

Jeff





  




Users Group Meeting Slides

2008-05-22 Thread Jeff Eastman
I uploaded the slides from my Mahout overview to our wiki 
(http://cwiki.apache.org/confluence/display/MAHOUT/FAQ) along with 
another recent talk by Isabel Drost. Both are similar in content but 
their differences reflect the rapid evolution of the project in the 
month that separates them in time. After I got home I worried a bit that 
I had skipped over too much of the material in an effort to keep it brief.


Thanks to Yahoo! for hosting the meeting and providing beer and pizza 
for the 50 or so friends of Hadoop who attended. I thought the meeting 
was fun and informative and a great way to begin to associate faces and 
richer personalities with the names that fly through my in box. For 
those of you who were not able to attend, the meeting was quite informal 
and we got brief, mostly extemporaneous, updates of some of the projects 
that were on the agenda at the recent Hadoop Summit: Hadoop 0.17, HBase, 
Pig, Zookeeper, Mahout, ...


In the wrapup, people seemed to like this agenda and I think Ajay plans 
to continue with the same general format in subsequent monthly meetings. 
Due to the informality, I did not take notes. Perhaps the other 
presenters can post summaries of their updates for the wider community's 
appreciation.


Jeff


Re: Hadoop 0.17 AMI?

2008-05-22 Thread Jeff Eastman

Thanks Tom, I'll try them out today
Jeff


Tom White wrote:

Hi Jeff,

I've built two public 0.17.0 AMIs (32-bit and 64-bit), so you should
be able to use the 0.17 scripts to launch them now.

Cheers,
Tom

On Thu, May 22, 2008 at 6:37 AM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
  

Hi Jeff,

0.17.0 was released yesterday, from what I can tell.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message ----
    

From: Jeff Eastman <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, May 21, 2008 11:18:56 AM
Subject: Re: Hadoop 0.17 AMI?

Any word on 0.17? I was able to build an AMI from a trunk checkout and
deploy a single node cluster but the create-hadoop-image-remote script
really wants a tarball in the archive. I'd rather not waste time munging
the scripts if a release is near.

Jeff

Nigel Daley wrote:
  

Hadoop 0.17 hasn't been released yet.  I (or Mukund) is hoping to call
a vote this afternoon or tomorrow.

Nige

On May 14, 2008, at 12:36 PM, Jeff Eastman wrote:


I'm trying to bring up a cluster on EC2 using
(http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the
version to use because of the DNS improvements, etc. Unfortunately, I
cannot find a public AMI with this build. Is there one that I'm not
finding or do I need to create one?

Jeff

  







  




Re: Hadoop experts wanted

2008-05-21 Thread Jeff Eastman

Hi Edward,

Check out this link 
(http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable) 
before you panic over the similar postings. Jim's a little vague about 
what he's actually going to do with this data or when, but I found it 
useful.


Jeff


Edward J. Yoon wrote:

Hey Akshar!
Just FYI, See http://www.nabble.com/Django-experts-wanted-td17322054.html

-Edward

On Thu, May 22, 2008 at 6:24 AM, Akshar <[EMAIL PROTECTED]> wrote:
  

Interesting!!

BTW, Where do you work?

On Thu, May 15, 2008 at 2:23 PM, Jim R. Wilson <[EMAIL PROTECTED]>
wrote:



Hi all,

Hadoop is a great project and a growing niche.  As it becomes even
more popular, there will be increasing demand for experts in the
field.

I am compiling a contact list of Hadoop experts who may be interested
in opportunities under the right circumstances.  I am not a recruiter
- I'm a regular developer who sometimes gets asked for referrals when
I'm not personally available.

If you'd like to be on my shortlist of go-to experts, please contact
me off-list at: [EMAIL PROTECTED]

Please be prepared to show your expertise by any of the following:
 * Committer status or patches accepted
 * Commit access to another open source project which uses Hadoop
 * Bugs reported which were either resolved or are still open (real bugs)
 * Articles / blog entries written about Hadoop concepts or development
 * Speaking engagements or user groups at which you've presented
 * Significant contributions to documentation
 * Other? (I'm sure I didn't think of everything)

I'll be happy to answer any questions, and I look forward to hearing from
you!

-- Jim R. Wilson (jimbojw)

  




  




Re: Hadoop 0.17 AMI?

2008-05-21 Thread Jeff Eastman
Any word on 0.17? I was able to build an AMI from a trunk checkout and 
deploy a single node cluster but the create-hadoop-image-remote script 
really wants a tarball in the archive. I'd rather not waste time munging 
the scripts if a release is near.


Jeff

Nigel Daley wrote:
Hadoop 0.17 hasn't been released yet.  I (or Mukund) is hoping to call 
a vote this afternoon or tomorrow.


Nige

On May 14, 2008, at 12:36 PM, Jeff Eastman wrote:

I'm trying to bring up a cluster on EC2 using
(http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the
version to use because of the DNS improvements, etc. Unfortunately, I
cannot find a public AMI with this build. Is there one that I'm not
finding or do I need to create one?

Jeff









Hadoop 0.17 AMI?

2008-05-14 Thread Jeff Eastman

I'm trying to bring up a cluster on EC2 using
(http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the
version to use because of the DNS improvements, etc. Unfortunately, I
cannot find a public AMI with this build. Is there one that I'm not
finding or do I need to create one?

Jeff



RE: Hadoop input path - can it have subdirectories

2008-04-01 Thread Jeff Eastman
My experience running with the Java API is that subdirectories in the input
path do cause an exception, so the streaming file input processing must be
different. 

Jeff Eastman
 

> -Original Message-
> From: Norbert Burger [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, April 01, 2008 9:46 AM
> To: core-user@hadoop.apache.org
> Cc: [EMAIL PROTECTED]
> Subject: Re: Hadoop input path - can it have subdirectories
> 
> Yes, this is fine, at least for Hadoop Streaming.  I specify the root of
> my
> logs directory as my -input parameter, and Hadoop correctly finds all of
> child directories.  What's the error you're seeing?  Is a stack trace
> available?
> 
> Norbert
> 
> On Tue, Apr 1, 2008 at 12:15 PM, Tarandeep Singh <[EMAIL PROTECTED]>
> wrote:
> 
> > Hi,
> >
> > Can I give a directory (having subdirectories) as input path to Hadoop
> > Map-Reduce Job.
> > I tried, but got error.
> >
> > Can Hadoop recursively traverse the input directory and collect all
> > the file names or the input path has to be just a directory containing
> > files (and no sub-directories) ?
> >
> > -Taran
> >




RE: Hadoop summit video capture?

2008-03-25 Thread Jeff Eastman
I don't know if there was a live version, but the entire summit was recorded
on video so it will be available. BTW, it was an overwhelming success and
the speakers are all well worth waiting for. I personally got a lot of
positive feedback and interest in Mahout, so expect your inbox to explode in
the next couple of days.

Jeff

> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 25, 2008 8:03 PM
> To: core-user@hadoop.apache.org
> Subject: Hadoop summit video capture?
> 
> Hi,
> 
> Wasn't there going to be a live stream from the Hadoop summit?  I couldn't
> find any references on the event site/page, and searches on veoh, youtube
> and google video yielded nothing.
> 
> Is an archived version of the video (going to be) available?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 





RE: Performance / cluster scaling question

2008-03-21 Thread Jeff Eastman
I wouldn't call it a design feature so much as a consequence of background
processing in the NameNode to clean up the recently-closed files and reclaim
their blocks.

Jeff

> -Original Message-
> From: André Martin [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 21, 2008 2:48 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Performance / cluster scaling question
> 
> Right, I totally forgot about the replication factor... However
> sometimes I even noticed ratios of 5:1 for block numbers to files...
> Is the delay for block deletion/reclaiming an intended behavior?
> 
> Jeff Eastman wrote:
> > That makes the math come out a lot closer (3*423763=1271289). I've also
> > noticed there is some delay in reclaiming unused blocks so what you are
> > seeing in terms of block allocations do not surprise me.
> >
> >
> >> -Original Message-
> >> From: André Martin [mailto:[EMAIL PROTECTED]
> >> Sent: Friday, March 21, 2008 2:36 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: Re: Performance / cluster scaling question
> >>
> >> 3 - the default one...
> >>
> >> Jeff Eastman wrote:
> >>
> >>> What's your replication factor?
> >>> Jeff
> >>>
> >>>
> >>>
> >>>> -Original Message-
> >>>> From: André Martin [mailto:[EMAIL PROTECTED]
> >>>> Sent: Friday, March 21, 2008 2:25 PM
> >>>> To: core-user@hadoop.apache.org
> >>>> Subject: Performance / cluster scaling question
> >>>>
> >>>> Hi everyone,
> >>>> I ran a distributed system that consists of 50 spiders/crawlers and 8
> >>>> server nodes with a Hadoop DFS cluster with 8 datanodes and a
> >>>>
> >> namenode...
> >>
> >>>> Each spider has 5 job processing / data crawling threads and puts
> >>>> crawled data as one complete file onto the DFS - additionally there
> are
> >>>> splits created for each server node that are put as files onto the
> DFS
> >>>> as well. So basically there are 50*5*9 = ~2250 concurrent writes
> across
> >>>> 8 datanodes.
> >>>> The splits are read by the server nodes and will be deleted
> afterwards,
> >>>> so those (split)-files exists for only a few seconds to minutes...
> >>>> Since 99% of the files have a size of less than 64 MB (the default
> >>>>
> >> block
> >>
> >>>> size) I expected that the number of files is roughly equal to the
> >>>>
> >> number
> >>
> >>>> of blocks. After running the system for 24hours the namenode WebUI
> >>>>
> >> shows
> >>
> >>>> 423763 files and directories and 1480735 blocks. It looks like that
> the
> >>>> system does not catch up with deleting all the invalidated blocks -
> my
> >>>> assumption?!?
> >>>> Also, I noticed that the overall performance of the cluster goes down
> >>>> (see attached image).
> >>>> There are a bunch of Could not get block locations. Aborting...
> >>>> exceptions and those exceptions seem to appear more frequently
> towards
> >>>> the end of the experiment.
> >>>>
> >>>>
> >>>>> java.io.IOException: Could not get block locations. Aborting...
> >>>>> at
> >>>>>
> >>>>>
> >>>>>
> >>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCl
> >>
> >>>> ient.java:1824)
> >>>>
> >>>>
> >>>>> at
> >>>>>
> >>>>>
> >>>>>
> >>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java
> >>
> >>>> :1479)
> >>>>
> >>>>
> >>>>> at
> >>>>>
> >>>>>
> >>>>>
> >>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient
> >>
> >>>> .java:1571)
> >>>> So, is the cluster simply saturated with the such a frequent creation
> >>>> and deletion of files, or is the network that actual bottleneck? The
> >>>> work load does not change at all during the whole experiment.
> >>>> On cluster side I see lots of the following exceptions:
> >>>&g

RE: Performance / cluster scaling question

2008-03-21 Thread Jeff Eastman
That makes the math come out a lot closer (3*423763=1271289). I've also
noticed there is some delay in reclaiming unused blocks so what you are
seeing in terms of block allocations do not surprise me.

> -Original Message-
> From: André Martin [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 21, 2008 2:36 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Performance / cluster scaling question
> 
> 3 - the default one...
> 
> Jeff Eastman wrote:
> > What's your replication factor?
> > Jeff
> >
> >
> >> -Original Message-
> >> From: André Martin [mailto:[EMAIL PROTECTED]
> >> Sent: Friday, March 21, 2008 2:25 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: Performance / cluster scaling question
> >>
> >> Hi everyone,
> >> I ran a distributed system that consists of 50 spiders/crawlers and 8
> >> server nodes with a Hadoop DFS cluster with 8 datanodes and a
> namenode...
> >> Each spider has 5 job processing / data crawling threads and puts
> >> crawled data as one complete file onto the DFS - additionally there are
> >> splits created for each server node that are put as files onto the DFS
> >> as well. So basically there are 50*5*9 = ~2250 concurrent writes across
> >> 8 datanodes.
> >> The splits are read by the server nodes and will be deleted afterwards,
> >> so those (split)-files exists for only a few seconds to minutes...
> >> Since 99% of the files have a size of less than 64 MB (the default
> block
> >> size) I expected that the number of files is roughly equal to the
> number
> >> of blocks. After running the system for 24hours the namenode WebUI
> shows
> >> 423763 files and directories and 1480735 blocks. It looks like that the
> >> system does not catch up with deleting all the invalidated blocks - my
> >> assumption?!?
> >> Also, I noticed that the overall performance of the cluster goes down
> >> (see attached image).
> >> There are a bunch of Could not get block locations. Aborting...
> >> exceptions and those exceptions seem to appear more frequently towards
> >> the end of the experiment.
> >>
> >>> java.io.IOException: Could not get block locations. Aborting...
> >>> at
> >>>
> >>>
> >>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCl
> >> ient.java:1824)
> >>
> >>> at
> >>>
> >>>
> >>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java
> >> :1479)
> >>
> >>> at
> >>>
> >>>
> >>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient
> >> .java:1571)
> >> So, is the cluster simply saturated with the such a frequent creation
> >> and deletion of files, or is the network that actual bottleneck? The
> >> work load does not change at all during the whole experiment.
> >> On cluster side I see lots of the following exceptions:
> >>
>= >>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
> >>> PacketResponder 1 for block blk_6757062148746339382 terminating
> >>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
> >>> writeBlock blk_6757062148746339382 received exception
> >>>
> >> java.io.EOFException
> >>
> >>> 2008-03-21 20:28:05,411 ERROR org.apache.hadoop.dfs.DataNode:
> >>> 141.xxx..xxx.xxx:50010:DataXceiver: java.io.EOFException
> >>> at java.io.DataInputStream.readInt(Unknown Source)
> >>> at
> >>>
> >>>
> >>
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22
> >> 63)
> >>
> >>> at
> >>>
> >>>
> >>
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
> >>
> >>> at
> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
> >>> at java.lang.Thread.run(Unknown Source)
> >>> 2008-03-21 19:26:46,535 INFO org.apache.hadoop.dfs.DataNode:
> >>> writeBlock blk_-7369396710977076579 received exception
> >>> java.net.SocketException: Connection reset
> >>> 2008-03-21 19:26:46,535 ERROR org.apache.hadoop.dfs.DataNode:
> >>> 141.xxx.xxx.xxx:50010:DataXceiver: java.net.SocketException:
> >>> Connection reset
> >>> at java.net.SocketInputStream.read(Unknown Source)
>

RE: Performance / cluster scaling question

2008-03-21 Thread Jeff Eastman
What's your replication factor? 
Jeff

> -Original Message-
> From: André Martin [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 21, 2008 2:25 PM
> To: core-user@hadoop.apache.org
> Subject: Performance / cluster scaling question
> 
> Hi everyone,
> I ran a distributed system that consists of 50 spiders/crawlers and 8
> server nodes with a Hadoop DFS cluster with 8 datanodes and a namenode...
> Each spider has 5 job processing / data crawling threads and puts
> crawled data as one complete file onto the DFS - additionally there are
> splits created for each server node that are put as files onto the DFS
> as well. So basically there are 50*5*9 = ~2250 concurrent writes across
> 8 datanodes.
> The splits are read by the server nodes and will be deleted afterwards,
> so those (split)-files exists for only a few seconds to minutes...
> Since 99% of the files have a size of less than 64 MB (the default block
> size) I expected that the number of files is roughly equal to the number
> of blocks. After running the system for 24hours the namenode WebUI shows
> 423763 files and directories and 1480735 blocks. It looks like that the
> system does not catch up with deleting all the invalidated blocks - my
> assumption?!?
> Also, I noticed that the overall performance of the cluster goes down
> (see attached image).
> There are a bunch of Could not get block locations. Aborting...
> exceptions and those exceptions seem to appear more frequently towards
> the end of the experiment.
> > java.io.IOException: Could not get block locations. Aborting...
> > at
> >
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCl
> ient.java:1824)
> > at
> >
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java
> :1479)
> > at
> >
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient
> .java:1571)
> So, is the cluster simply saturated with the such a frequent creation
> and deletion of files, or is the network that actual bottleneck? The
> work load does not change at all during the whole experiment.
> On cluster side I see lots of the following exceptions:
> > 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
> > PacketResponder 1 for block blk_6757062148746339382 terminating
> > 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
> > writeBlock blk_6757062148746339382 received exception
> java.io.EOFException
> > 2008-03-21 20:28:05,411 ERROR org.apache.hadoop.dfs.DataNode:
> > 141.xxx..xxx.xxx:50010:DataXceiver: java.io.EOFException
> > at java.io.DataInputStream.readInt(Unknown Source)
> > at
> >
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22
> 63)
> > at
> >
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
> > at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
> > at java.lang.Thread.run(Unknown Source)
> > 2008-03-21 19:26:46,535 INFO org.apache.hadoop.dfs.DataNode:
> > writeBlock blk_-7369396710977076579 received exception
> > java.net.SocketException: Connection reset
> > 2008-03-21 19:26:46,535 ERROR org.apache.hadoop.dfs.DataNode:
> > 141.xxx.xxx.xxx:50010:DataXceiver: java.net.SocketException:
> > Connection reset
> > at java.net.SocketInputStream.read(Unknown Source)
> > at java.io.BufferedInputStream.fill(Unknown Source)
> > at java.io.BufferedInputStream.read(Unknown Source)
> > at java.io.DataInputStream.readInt(Unknown Source)
> > at
> >
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22
> 63)
> > at
> >
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
> > at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
> > at java.lang.Thread.run(Unknown Source)
> I'm running Hadoop 0.16.1 - Has anyone made the same or a similar
> experience.
> How can the performance degradation be avoided? More datanodes? Why
> seems the block deletion not to catch up with the deletion of the file?
> Thanks in advance for your insights, ideas & suggestions :-)
> 
> Cu on the 'net,
> Bye - bye,
> 
>< André   èrbnA >





RE: Master as DataNode

2008-03-21 Thread Jeff Eastman
I don't know the deep answer, but formatting your dfs creates a new
namespaceId that needs to be consistent across all slaves. Any data
directories containing old version ids will prevent the DataNode from
starting on that node. Maybe somebody who really knows the machinery can
elaborate to this. 

Glad you are flying now,
Jeff

> -Original Message-
> From: Colin Freas [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 21, 2008 1:51 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Master as DataNode
> 
> yup, got it working with that technique.
> 
> pushed it out to 5 machines, things look good.  appreciate the help.
> 
> what is it that causes this?  i know i formatted the dfs more than once.
> is
> that what does it?  or just adding nodes, or...  ?
> 
> -colin
> 
> 
> On Fri, Mar 21, 2008 at 2:30 PM, Jeff Eastman <[EMAIL PROTECTED]>
> wrote:
> 
> > I encountered this while I was starting out too, while moving from a
> > single
> > node cluster to more nodes. I suggest clearing your hadoop-datastore
> > directory, reformatting the HDFS and restarting again. You are very
> close
> > :)
> > Jeff
> >
> > > -Original Message-
> > > From: Colin Freas [mailto:[EMAIL PROTECTED]
> > > Sent: Friday, March 21, 2008 11:18 AM
> > > To: core-user@hadoop.apache.org
> > > Subject: Re: Master as DataNode
> > >
> > > ah:
> > >
> > > 2008-03-21 14:06:05,526 ERROR org.apache.hadoop.dfs.DataNode:
> > > java.io.IOException: Incompatible namespaceIDs in
> > > /var/tmp/hadoop-datastore/hadoop/dfs/data: namenode namespaceID =
> > > 2121666262; datanode namespaceID = 2058961420
> > >
> > >
> > > looks like i'm hitting this "Incompatible namespaceID" bug:
> > > http://issues.apache.org/jira/browse/HADOOP-1212
> > >
> > > is there a work around for this?
> > >
> > > -colin
> > >
> > >
> > > On Fri, Mar 21, 2008 at 1:50 PM, Jeff Eastman <
> > [EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Check your logs. That should work out of the box with the
> > configuration
> > > > steps you described.
> > > >
> > > > Jeff
> > > >
> > > > > -Original Message-
> > > > > From: Colin Freas [mailto:[EMAIL PROTECTED]
> > > > > Sent: Friday, March 21, 2008 10:40 AM
> > > > > To: core-user@hadoop.apache.org
> > > > > Subject: Master as DataNode
> > > > >
> > > > > setting up a simple hadoop cluster with two machines, i've gotten
> to
> > > the
> > > > > point where the two machines can see each other, things seem fine,
> > but
> > > > i'm
> > > > > trying to set up the master as both a master and a slave, just for
> > > > testing
> > > > > purposes.
> > > > >
> > > > > so, i've put the master into the conf/masters file and the
> > conf/slaves
> > > > > file.
> > > > >
> > > > > things seem to work, but there's no DataNode process listed with
> jps
> > > on
> > > > > the
> > > > > master.  i'm wondering if there's a switch i need to flip to tell
> > > hadoop
> > > > > to
> > > > > use the master as a datanode even if it's in the slaves file?
> > > > >
> > > > > thanks again.
> > > > >
> > > > > -colin
> > > >
> > > >
> > > >
> >
> >
> >




RE: Master as DataNode

2008-03-21 Thread Jeff Eastman
I encountered this while I was starting out too, while moving from a single
node cluster to more nodes. I suggest clearing your hadoop-datastore
directory, reformatting the HDFS and restarting again. You are very close :)
Jeff

> -Original Message-
> From: Colin Freas [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 21, 2008 11:18 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Master as DataNode
> 
> ah:
> 
> 2008-03-21 14:06:05,526 ERROR org.apache.hadoop.dfs.DataNode:
> java.io.IOException: Incompatible namespaceIDs in
> /var/tmp/hadoop-datastore/hadoop/dfs/data: namenode namespaceID =
> 2121666262; datanode namespaceID = 2058961420
> 
> 
> looks like i'm hitting this "Incompatible namespaceID" bug:
> http://issues.apache.org/jira/browse/HADOOP-1212
> 
> is there a work around for this?
> 
> -colin
> 
> 
> On Fri, Mar 21, 2008 at 1:50 PM, Jeff Eastman <[EMAIL PROTECTED]>
> wrote:
> 
> > Check your logs. That should work out of the box with the configuration
> > steps you described.
> >
> > Jeff
> >
> > > -Original Message-
> > > From: Colin Freas [mailto:[EMAIL PROTECTED]
> > > Sent: Friday, March 21, 2008 10:40 AM
> > > To: core-user@hadoop.apache.org
> > > Subject: Master as DataNode
> > >
> > > setting up a simple hadoop cluster with two machines, i've gotten to
> the
> > > point where the two machines can see each other, things seem fine, but
> > i'm
> > > trying to set up the master as both a master and a slave, just for
> > testing
> > > purposes.
> > >
> > > so, i've put the master into the conf/masters file and the conf/slaves
> > > file.
> > >
> > > things seem to work, but there's no DataNode process listed with jps
> on
> > > the
> > > master.  i'm wondering if there's a switch i need to flip to tell
> hadoop
> > > to
> > > use the master as a datanode even if it's in the slaves file?
> > >
> > > thanks again.
> > >
> > > -colin
> >
> >
> >




RE: Master as DataNode

2008-03-21 Thread Jeff Eastman
Check your logs. That should work out of the box with the configuration
steps you described. 

Jeff

> -Original Message-
> From: Colin Freas [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 21, 2008 10:40 AM
> To: core-user@hadoop.apache.org
> Subject: Master as DataNode
> 
> setting up a simple hadoop cluster with two machines, i've gotten to the
> point where the two machines can see each other, things seem fine, but i'm
> trying to set up the master as both a master and a slave, just for testing
> purposes.
> 
> so, i've put the master into the conf/masters file and the conf/slaves
> file.
> 
> things seem to work, but there's no DataNode process listed with jps on
> the
> master.  i'm wondering if there's a switch i need to flip to tell hadoop
> to
> use the master as a datanode even if it's in the slaves file?
> 
> thanks again.
> 
> -colin




RE: why the value of attribute in map function will change ?

2008-03-16 Thread Jeff Eastman
Consider that your mapper and driver execute in different JVMs and cannot
share static values.

Jeff

> -Original Message-
> From: ma qiang [mailto:[EMAIL PROTECTED]
> Sent: Saturday, March 15, 2008 10:35 PM
> To: core-user@hadoop.apache.org
> Subject: why the value of attribute in map function will change ?
> 
> Hi all:
> I have this map class as below;
> public class TestMap extends MapReduceBase implements Mapper
> {
>   private static int value;
> 
>   public TestMap()
>   {
>value=100;
>   }
> 
>   public void map..
> 
>  
> }
> 
>  and in my Driver Class as below:
>public class Driver {
> 
>   public void main() throws Exception {
> conf...
> 
>   client.setConf(conf);
>   try {
>   JobClient.runJob(conf);
>   } catch (Exception e) {
>   e.printStackTrace();
>   }
> 
> System.out.println(TestMap.value); //here the
> value printed is 0;
>   }
> 
> I don't know why the TestMap.value printed is not 100 but 0, who can
> tell me why ?
> Thanks very much!




RE: Map/Reduce Type Mismatch error

2008-03-07 Thread Jeff Eastman
The key provided by the default FileInputFormat is not Text, but an
integer offset into the split(which is not very usful IMHO). Try
changing your mapper back to . If you are
expecting the file name to be the key, you will (I think) need to write
your own InputFormat.

Jeff

-Original Message-
From: Prasan Ary [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 07, 2008 3:50 PM
To: hadoop
Subject: Map/Reduce Type Mismatch error

  Hi All,
  I am running a Map/Reduce on a textfile.
  Map takes  as (key,value) input pair , and outputs
 as (key,value) output pair.
   
  Reduce takes  as (key,value) input pair, and outputs
 as (key,value) output pair.
   
  I am getting a type mismatch error.
   
  Any suggestion?
   
   
  JobConf job = new JobConf(..
   
  job.setOutputKeyClass(Text.class); 
  job.setOutputValueClass(Text.class); 
   
  -
  public static class Map extends MapReduceBase implements Mapper {
..
public void map(Text key, Text value, OutputCollector output, Reporter reporter) throws IOException { ..
   
  output.collect(key,new IntWritable(1));
   
  
   
  public static class Reduce extends MapReduceBase implements
Reducer { 
  public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException
{  ..
   
  output.collect(key, new Text("SomeText");

   
-
Never miss a thing.   Make Yahoo your homepage.


RE: Equivalent of cmdline head or tail?

2008-03-06 Thread Jeff Eastman
I think the accepted pattern for this is to accumulate your top N and
bottom N values while you reduce and then output them in the close()
call. The files from your config can be obtained during the configure()
call.

Jeff

-Original Message-
From: Jimmy Wan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 06, 2008 10:40 AM
To: core-user@hadoop.apache.org
Subject: Equivalent of cmdline head or tail?

I've got some jobs where I'd like to just pull out the top N or bottom N

values.

It seems like I can't do this from the map or combine phases (due to not

having enough data), but I could aggregate this data during the reduce  
phase. The problem I have is that I won't know when to actually write
them  
out until I've gone through the entire set, at which point reduce isn't

called anymore.

It's easy enough to post-process with some combination of sort, head,
and  
tail, but I was wondering if I was missing something obvious.

-- 
Jimmy


RE: Decompression Blues

2008-02-27 Thread Jeff Eastman
I unzipped and rezipped all the files using gzip 1.3.3 and uploaded the
files again. I got the same exceptions.

I set the hadoop.native.lib property to false and bounced the cloud,
then ran my job. I still get the same exceptions.

Any more suggestions?

Jeff

-Original Message-
From: Arun C Murthy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 26, 2008 3:47 PM
To: core-user@hadoop.apache.org
Subject: Re: Decompression Blues

Jeff,

On Feb 26, 2008, at 12:58 PM, Jeff Eastman wrote:

> I'm processing a number of .gz compressed Apache and other logs using
> Hadoop 0.15.2 and encountering fatal decompression errors such as:
>
>

How did you compress your input files? Could you share details on the  
version of your gzip and other tools?

Try setting "hadoop.native.lib" property to 'false' via  
NativeCodeLoader.setLoadNativeLibraries for you job and see how it  
works...

Arun

>
> 08/02/26 12:09:12 INFO mapred.JobClient: Task Id :
> task_200802171116_0001_m_05_0, Status : FAILED
>
> java.lang.InternalError
>
> at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.init(Native  
> Method)
>
> at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor. 
> (ZlibDecompres
> sor.java:111)
>
> at
> org.apache.hadoop.io.compress.GzipCodec.createDecompressor 
> (GzipCodec.jav
> a:188)
>
> at
> org.apache.hadoop.io.compress.GzipCodec.createInputStream 
> (GzipCodec.java
> :170)
>
> at
> org.apache.hadoop.mapred.LineRecordReader. 
> (LineRecordReader.java:7
> 5)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader 
> (TextInputFormat
> .java:50)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:156)
>
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1787)
>
>
>
> I looked in Jira but did not find any issues. Is this pilot error?  
> Some
> of the files work just fine. Is there a workaround besides  
> unzipping all
> the files in the DFS?
>
>
>
> Jeff
>



RE: Decompression Blues

2008-02-26 Thread Jeff Eastman
My ops guys actually zipped them and who knows what version they used. I
will try rezipping with the gzip on my Linux box and see if that fixes
the problem. Then I'll try setting the native.lib property to false.
Will keep you posted.

Thanks to everybody who has responded,
Jeff

-Original Message-
From: Arun C Murthy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 26, 2008 3:47 PM
To: core-user@hadoop.apache.org
Subject: Re: Decompression Blues

Jeff,

On Feb 26, 2008, at 12:58 PM, Jeff Eastman wrote:

> I'm processing a number of .gz compressed Apache and other logs using
> Hadoop 0.15.2 and encountering fatal decompression errors such as:
>
>

How did you compress your input files? Could you share details on the  
version of your gzip and other tools?

Try setting "hadoop.native.lib" property to 'false' via  
NativeCodeLoader.setLoadNativeLibraries for you job and see how it  
works...

Arun

>
> 08/02/26 12:09:12 INFO mapred.JobClient: Task Id :
> task_200802171116_0001_m_05_0, Status : FAILED
>
> java.lang.InternalError
>
> at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.init(Native  
> Method)
>
> at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor. 
> (ZlibDecompres
> sor.java:111)
>
> at
> org.apache.hadoop.io.compress.GzipCodec.createDecompressor 
> (GzipCodec.jav
> a:188)
>
> at
> org.apache.hadoop.io.compress.GzipCodec.createInputStream 
> (GzipCodec.java
> :170)
>
> at
> org.apache.hadoop.mapred.LineRecordReader. 
> (LineRecordReader.java:7
> 5)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader 
> (TextInputFormat
> .java:50)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:156)
>
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1787)
>
>
>
> I looked in Jira but did not find any issues. Is this pilot error?  
> Some
> of the files work just fine. Is there a workaround besides  
> unzipping all
> the files in the DFS?
>
>
>
> Jeff
>



Decompression Blues

2008-02-26 Thread Jeff Eastman
I'm processing a number of .gz compressed Apache and other logs using
Hadoop 0.15.2 and encountering fatal decompression errors such as:

 

08/02/26 12:09:12 INFO mapred.JobClient: Task Id :
task_200802171116_0001_m_05_0, Status : FAILED

java.lang.InternalError

at
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.init(Native Method)

at
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.(ZlibDecompres
sor.java:111)

at
org.apache.hadoop.io.compress.GzipCodec.createDecompressor(GzipCodec.jav
a:188)

at
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java
:170)

at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:7
5)

at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat
.java:50)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:156)

at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1787)

 

I looked in Jira but did not find any issues. Is this pilot error? Some
of the files work just fine. Is there a workaround besides unzipping all
the files in the DFS?

 

Jeff



RE: newbie question... please help.

2008-02-23 Thread Jeff Eastman
If your main question is "can I host my mssql database on the Hadoop
DFS?", then the answer is no. The DFS is designed for large files that
are write once, read multiple and a database engine would want to update
the files. 

If, OTOH, your question is "can I move (some of) my mssql database into
Hadoop so I can run some map/reduce jobs against it?", then the answer
is yes, keep on reading.

Deploying Hadoop on Win2k is reputedly possible using Cygwin, but more
complicated than just using Linux directly. You are going to have to
learn your way around Linux one way or the other, so you might as well
take the easier path. It will be an adventure well worth your time.

Jeff

-Original Message-
From: Sher Khan [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 23, 2008 9:29 AM
To: core-user@hadoop.apache.org
Subject: Re: newbie question... please help.

thanks for the quick note.
problem is, i am a complete rookie @ linux + related technologies so
will
have to read n reread stuff a lot - which i am ready for - so will study
the
wiki more carefully.

can u just respond in a lil detail that if hadoop is a dfs & it runs on
or
has the ability of map/reduce paradigms, can i host any of my databases
on
to it? or is it that since it does a map/reduce, setting a db on hadoof
dfs
shall prove to be entirely useless?

i know this might sound completely stoopid for all u experts however its
just my bad that i started on this late :(

please help!

On Sat, Feb 23, 2008 at 8:09 PM, 11 Nov. <[EMAIL PROTECTED]> wrote:

> Hadoop is DFS+MapReduce, which two concepts are totally independent.
> I think you can start with hadoop wiki and the tutorial there.
>
> 2008/2/23, Sher Khan <[EMAIL PROTECTED]>:
> >
> > hi brothers,
> > i am completely confused about the hadoop usage / deployment etc.
> >
> > yes, i did read the documentation & other details on apach
foundation
> site
> > yet i am a dumb a** n perhaps hence am still confused all the more.
> >
> > pray allow me to explain my situation in detail.
> >
> > i have different clusters of servers to perform www + application +
> > database
> > + storage functions @ my work.
> >
> > the entire architecture is running on win2k3 enterprise server +
mssql
> 2k5
> > +
> > asp.net 2.0 for the frontend.
> >
> > i was wondering if i cud be in a position to use / deploy hadoop @
my
> > scenario & if i can how?
> >
> > if i am not mistaken, hadoop is a distributed file system [similar
to
> > ntfs,
> > hpfs, fat32, etc. except for its far too advanced & is capable of
> storing
> > the data residing on it in "map/reduce" kind of architecture] & if i
> need
> > to
> > use hadoop @ my location, i shall need to engineer my applications
> ground
> > up
> > on php / java, mysql [or some other open source compliant db] &
linux as
> > my
> > o/s & storage system?
> >
> > can some kind soul please help me understand the concept &/or guide
me
> to
> > a
> > better drafter reference resources please?
> >
> > thanks in advance for the help bro's.
> >
>


RE: Best Practice?

2008-02-11 Thread Jeff Eastman
You're right again. Once the reducer has clustered all its input canopy
centroids it is done and can collect the resulting canopies to output. I
guess I was just wedged in that close() pattern.

Thanks,
Jeff

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 11, 2008 12:40 PM
To: core-user@hadoop.apache.org
Subject: Re: Best Practice?


Jeff,

Doesn't the reducer see all of the data points for each cluster (canopy)
in
a single list?

If so, why the need to output during close?

If not, why not?


On 2/11/08 12:24 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:

> Hi Owen,
> 
> Thanks for the information. I took Ted's advice and refactored my
mapper
> so as to use a combiner and that solved my front-end canopy generation
> problem, but I still have to output the final canopies in the reducer
> during close() since there is no similar combiner mechanism. I was
> worried about this, but now I won't.
> 
> Thanks,
> Jeff
> 
> -Original Message-
> From: Owen O'Malley [mailto:[EMAIL PROTECTED]
> Sent: Monday, February 11, 2008 10:40 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Best Practice?
> 
> 
> On Feb 9, 2008, at 4:21 PM, Jeff Eastman wrote:
> 
>> I'm trying to wait until close() to output the cluster centroids to
>> the
>> reducer, but the OutputCollector is not available.
> 
> You hit on exactly the right solution. Actually, because of Pipes and
> Streaming, you have a lot more guarantees than you would expect. In
> particular, you can call output.collect when the framework is between
> calls to map or reduce up until the close finishes.
> 
> -- Owen
> 



RE: Best Practice?

2008-02-11 Thread Jeff Eastman
Hi Owen,

Thanks for the information. I took Ted's advice and refactored my mapper
so as to use a combiner and that solved my front-end canopy generation
problem, but I still have to output the final canopies in the reducer
during close() since there is no similar combiner mechanism. I was
worried about this, but now I won't.

Thanks,
Jeff

-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 11, 2008 10:40 AM
To: core-user@hadoop.apache.org
Subject: Re: Best Practice?


On Feb 9, 2008, at 4:21 PM, Jeff Eastman wrote:

> I'm trying to wait until close() to output the cluster centroids to  
> the
> reducer, but the OutputCollector is not available.

You hit on exactly the right solution. Actually, because of Pipes and  
Streaming, you have a lot more guarantees than you would expect. In  
particular, you can call output.collect when the framework is between  
calls to map or reduce up until the close finishes.

-- Owen



RE: Best Practice?

2008-02-10 Thread Jeff Eastman
H indeed, this is certainly food for thought. I'm cross-posting this
to Mahout since it bears upon my recent submission there. Here's what
that does, and also how I think I can incorporate these ideas into it
too.

Each canopy mapper sees only a subset of the points. It goes ahead and
assigns them to canopies based upon the distance measure and thresholds.
Once it is done, in close(), it computes and outputs the canopy
centroids to the reducer using a constant key.

The canopy reducer sees the entire set of centroids, and clusters them
again into the final canopy centroids that are output. This set of
centroids will then be loaded into all clustering mappers, during
configure(), for the final clustering.

Thinking about your suggestion; if the canopy mapper only maintains
canopy centers, and outputs each point keyed by its canopyCenterId
(perhaps multiple times if a point is covered by more than one canopy)
to a combiner, and if the combiner then sums all of its points to
compute the centroid for output to the canopy reducer, then I won't have
to be outputting stuff during close(). While that seems to work, it
doesn't feel right. Using a combiner in this manner would avoid that.

Did I get it?
Jeff



-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 09, 2008 7:07 PM
To: core-user@hadoop.apache.org
Subject: Re: Best Practice?



Hmmm

I think that computing centroids in the mapper may not be the best idea.

A different structure that would work well is to use the mapper to
assign
data records to centroids and use the centroid number as the key for the
reduce key.  Then the reduce itself can compute the centroids.  You can
read
the old centroids from HDFS in the configure method of the mapper.
Lather,
rinse, repeat.

This process avoids moving large amounts of data through the
configuration
process.

This method can be extended to more advanced approaches such as Gaussian
mixtures by emitting each input record multiple times with multiple
centroid
keys and a strength of association.

Computing centroids in the mapper works well in that it minimizes the
amount
of data that is passed to the reducers, but it critically depends on the
availability of sufficient statistic for computing cluster centroids.
This
works fine for Gaussian processes (aka k-means), but there are other
mixture
models that require fancier updates than this.

Computing centroids in the reducer allows you avoid your problem with
the
output collector.  If sufficient statistics like sums (means) are
available
then you can use a combiner to do the reduction incrementally and avoid
moving too much data around.  The reducer will still have to accumulate
these partial updates for final output, but it won't have to compute
very
much of them.

All of this is completely analogous to word-counting, actually.  You
don't
accumulate counts in the mapper; you accumulate partial sums in the
combiner
and final sums in the reducer.




On 2/9/08 4:21 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:

> Thanks Aaron, I missed that one. Now I have my configuration
information
> in my mapper. In the mapper, I'm computing cluster centroids by
reading
> all the input points and assigning them to clusters. I don't actually
> store the points in the mapper, just the evolving centroids.
> 
> I'm trying to wait until close() to output the cluster centroids to
the
> reducer, but the OutputCollector is not available. Is there a way to
do
> this, or do I need to backtrack?
> 
> Jeff
> 
> 



RE: Best Practice?

2008-02-09 Thread Jeff Eastman
Well, I tried saving the OutputCollectors in an instance variable and
writing to them during close and it seems to work. 

Jeff

-Original Message-
From: Jeff Eastman [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 09, 2008 4:21 PM
To: core-user@hadoop.apache.org
Subject: RE: Best Practice?

Thanks Aaron, I missed that one. Now I have my configuration information
in my mapper. In the mapper, I'm computing cluster centroids by reading
all the input points and assigning them to clusters. I don't actually
store the points in the mapper, just the evolving centroids. 

I'm trying to wait until close() to output the cluster centroids to the
reducer, but the OutputCollector is not available. Is there a way to do
this, or do I need to backtrack?

Jeff




RE: Best Practice?

2008-02-09 Thread Jeff Eastman
Thanks Aaron, I missed that one. Now I have my configuration information
in my mapper. In the mapper, I'm computing cluster centroids by reading
all the input points and assigning them to clusters. I don't actually
store the points in the mapper, just the evolving centroids. 

I'm trying to wait until close() to output the cluster centroids to the
reducer, but the OutputCollector is not available. Is there a way to do
this, or do I need to backtrack?

Jeff




Best Practice?

2008-02-09 Thread Jeff Eastman
What's the best way to get additional configuration arguments to my
mappers and reducers?

 

Jeff



RE: Starting up a larger cluster

2008-02-08 Thread Jeff Eastman
I noticed that phenomena right off the bat. Is that a designed "feature"
or just an unhappy consequence of how blocks are allocated? Ted
compensates for this by aggressively rebalancing his cluster often by
adjusting the replication up and down, but I wonder if an improvement in
the allocation strategy would improve this. 

I've also used Ted's trick, with less than marvelous results. I'd hate
to pull my biggest machine (where I store all the backup files) out of
the cluster just to get more even block distribution but I may have to.

Jeff

-Original Message-
From: Allen Wittenauer [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 08, 2008 9:15 AM
To: core-user@hadoop.apache.org
Subject: Re: Starting up a larger cluster

On 2/7/08 11:01 PM, "Tim Wintle" <[EMAIL PROTECTED]> wrote:

>  it's
> useful to be able to connect from nodes that aren't in the slaves file
> so that you can put in input data direct from another machine that's
not
> part of the cluster,

I'd actually recommend this as a best practice.  We've been bit
over...
and over... and over... with users loading data into HDFS from a data
node
only to discover that the block distribution is pretty horrid which
in
turn means that MR performance is equally horrid. [Remember: all writes
will
go the local node if it is a data node!]

We're now down to the point that we've got one (relatively smaller)
grid
that is used for data loading/creation/extraction which then distcp's
its
contents to another grid.

Less than ideal, but definitely helps the performance of the entire
'real' grid.




RE: Starting up a larger cluster

2008-02-07 Thread Jeff Eastman
Oops, should be TaskTracker.

-Original Message-
From: Jeff Eastman [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 12:24 PM
To: core-user@hadoop.apache.org
Subject: RE: Starting up a larger cluster

Hi Ben,

I've been down this same path recently and I think I understand your
issues:

1) Yes, you need the hadoop folder to be in the same location on each
node. Only the master node actually uses the slaves file, to start up
DataNode and JobTracker daemons on those nodes.
2) If you did not specify any slave nodes on your master node then the
start-all did not create these processes on any nodes other than master.
This node can be accessed and the dfs written to from other machines as
you can do but there is no replication since there is only one DataNode.

Try running jps on your other nodes to verify this, and access the
NameNode web page to see what slaves you actually have running. By
adding your slave nodes to the slaves file on your master and bouncing
hadoop you should see a big difference in the size of your cluster.

Good luck, it's an adventure,
Jeff

-Original Message-
From: Ben Kucinich [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 10:52 AM
To: core-user@hadoop.apache.org
Subject: Starting up a larger cluster

In the Nutch wiki, I was reading this
http://wiki.apache.org/hadoop/GettingStartedWithHadoop

I have problems understanding this section:

== Starting up a larger cluster ==

 Ensure that the Hadoop package is accessible from the same path on
all nodes that are to be included in the cluster. If you have
separated configuration from the install then ensure that the config
directory is also accessible the same way.
 Populate the slaves file with the nodes to be included in the
cluster. One node per line.

1) Does the first line mean, that I have to place the hadoop folder in
exactly the same location on every slave node? For example, if I put
hadoop home directory in my /usr/local/ in master node it should be
present in /user/local/ in all the slave nodes as well?

2) I ran start-all.sh in one node (192.168.1.2) with fs.default.name
as 192.168.1.2:9000 and mapred.job.tracker as 192.168.1.2:9001. So, I
believe this will play the role of master node. I did not populate the
slaves file with any slave nodes. But in many other systems,
192.168.1.3, 192.168.1.4, etc. I made the same settings in
hadoop-site.xml. So I believe these are slave nodes. Now in the slave
nodes I ran commands like bin/hadoop -dfs put dir newdir and the
newdir was created in the DFS. I wonder how the master node allowed
the slave nodes to put the files even though I did not populate the
slaves file.

Please help me with these queries since I am new to Hadoop.



RE: Starting up a larger cluster

2008-02-07 Thread Jeff Eastman
Hi Ben,

I've been down this same path recently and I think I understand your
issues:

1) Yes, you need the hadoop folder to be in the same location on each
node. Only the master node actually uses the slaves file, to start up
DataNode and JobTracker daemons on those nodes.
2) If you did not specify any slave nodes on your master node then the
start-all did not create these processes on any nodes other than master.
This node can be accessed and the dfs written to from other machines as
you can do but there is no replication since there is only one DataNode.

Try running jps on your other nodes to verify this, and access the
NameNode web page to see what slaves you actually have running. By
adding your slave nodes to the slaves file on your master and bouncing
hadoop you should see a big difference in the size of your cluster.

Good luck, it's an adventure,
Jeff

-Original Message-
From: Ben Kucinich [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 10:52 AM
To: core-user@hadoop.apache.org
Subject: Starting up a larger cluster

In the Nutch wiki, I was reading this
http://wiki.apache.org/hadoop/GettingStartedWithHadoop

I have problems understanding this section:

== Starting up a larger cluster ==

 Ensure that the Hadoop package is accessible from the same path on
all nodes that are to be included in the cluster. If you have
separated configuration from the install then ensure that the config
directory is also accessible the same way.
 Populate the slaves file with the nodes to be included in the
cluster. One node per line.

1) Does the first line mean, that I have to place the hadoop folder in
exactly the same location on every slave node? For example, if I put
hadoop home directory in my /usr/local/ in master node it should be
present in /user/local/ in all the slave nodes as well?

2) I ran start-all.sh in one node (192.168.1.2) with fs.default.name
as 192.168.1.2:9000 and mapred.job.tracker as 192.168.1.2:9001. So, I
believe this will play the role of master node. I did not populate the
slaves file with any slave nodes. But in many other systems,
192.168.1.3, 192.168.1.4, etc. I made the same settings in
hadoop-site.xml. So I believe these are slave nodes. Now in the slave
nodes I ran commands like bin/hadoop -dfs put dir newdir and the
newdir was created in the DFS. I wonder how the master node allowed
the slave nodes to put the files even though I did not populate the
slaves file.

Please help me with these queries since I am new to Hadoop.



RE: Platform reliability with Hadoop

2008-01-22 Thread Jeff Eastman
It's alive! Just in case any others follow this thread, I ended up
overwriting only the following hadoop-default.xml entries:

Physical location of DFS on local disks:
- dfs.name.dir = /u1/cloud-data 
- dfs.data.dir = /u1/cloud-data

DFS location of mapred files:
- mapred.system.dir = /hadoop/mapred/system
- mapred.temp.dir = /hadoop/mapred/temp

Each user gets their own /users/username directory in the DFS and jobs
submitted by each user use their own user directories. Now to find a
bigger problem to solve...

Jeff


-Original Message-
From: Jeff Eastman [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 21, 2008 11:15 AM
To: [EMAIL PROTECTED]
Subject: RE: Platform reliability with Hadoop

Is it really that simple? 

The Wiki page GettingStartedWithHadoop recommends setting dfs.name.dir,
dfs.data.dir, dfs.client.buffer.dir and mapred.local.dir to
"appropriate" values (without giving an example). Should these be fixed
(XX) or variable (XX-${user.name}) values? The FAQ page recommends
setting the mapred.system.dir to a fixed value (e.g.
/hadoop/mapred/system), so I chose fixed values too.

- dfs.name.dir - /u1/cloud-data
- dfs.data.dir - /u1/cloud-data
- mapred.system.dir - /u1/cloud-data
- mapred.local.dir - /u1/cloud-data

I did not overwrite the dfs.client.buffer.dir (Determines where on the
local filesystem an DFS client should store its blocks before it sends
them to the datanode) because my 'jeastman' client could not put data
into the dfs with it set to the fixed value.

There are 4 other settings that use the ${hadoop.tmp.dir}, and these
seem appropriately tmp-ish. I did not redefine them:

- fs.trash.root - The trash directory, used by FsShell's 'rm' command.
- fs.checkpoint.dir - Determines where on the local filesystem the DFS
secondary name node should store the temporary images and edits to
merge.
- fs.s3.buffer.dir - Determines where on the local filesystem the S3
filesystem should store its blocks before it sends them to S3 or after
it retrieves them from S3.
- mapred.temp.dir - A shared directory for temporary files.

Jeff

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Sunday, January 20, 2008 11:44 AM
To: [EMAIL PROTECTED]
Subject: Re: Platform reliability with Hadoop

you might want to change hadoop.tmp.dir entry alone. since others are
derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit

- Original Message 
From: Jeff Eastman <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop


I am almost operational again but something in my configuration is
 still
not quite right. Here's what I did:

- I created a directory /u1/cloud-data on every machine's local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir entries for:
  - mapred.system.dir
  - mapred.local.dir
  - dfs.name.dir
  - dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as 'jeastman'
- When I submit a job as 'jeastman' that used to run, it runs but
 cannot
locate any input data so it quits immediately with this in the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:

I've attached my site.xml file.
Jeff
-Original Message-
From: Jason Venner [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 16, 2008 10:04 AM
To: [EMAIL PROTECTED]
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files 
elsewhere.

[EMAIL PROTECTED] wrote:
>> The DFS is stored in /tmp on each box. 
>> The developers who own the machines occasionally reboot and
 reprofile
them
>> 
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?
> Good idea is to configure DFS to be any place other than /tmp 
>
> Thanks,
> Lohit
> - Original Message 
> From: Jeff Eastman <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make m