Re: Hadoop streaming performance: elements vs. vectors

2009-03-28 Thread Paco NATHAN
hi peter,
thinking aloud on this -

trade-offs may depend on:

   * how much grouping would be possible (tracking a PDF would be
interesting for metrics)
   * locality of key/value pairs (distributed among mapper and reducer tasks)

to that point, will there be much time spent in the shuffle?  if so,
it's probably cheaper to shuffle/sort the grouped row vectors than the
many small key,value pair

in any case, when i had a similar situation on a large data set (2-3
Tb shuffle) a good pattern to follow was:

   * mapper emitted small key,value pairs
   * combiner grouped into row vectors

that combiner may get invoked both at the end of the map phase and at
the beginning of the reduce phase (more benefit)

also, using byte arrays if possible to represent values may be able to
save much shuffle time

best,
paco


On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
peter.skomor...@gmail.com wrote:
 Hadoop streaming question: If I am forming a matrix M by summing a number of
 elements generated on different mappers, is it better to emit tons of lines
 from the mappers with small key,value pairs for each element, or should I
 group them into row vectors before sending to the reducers?

 For example, say I'm summing frequency count matrices M for each user on a
 different map task, and the reducer combines the resulting sparse user count
 matrices for use in another calculation.

 Should I emit the individual elements:

 i (j, Mij) \n
 3 (1, 3.4) \n
 3 (2, 3.4) \n
 3 (3, 3.4) \n
 4 (1, 2.3) \n
 4 (2, 5.2) \n

 Or posting list style vectors?

 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
 4 ((1, 2.3), (2, 5.2)) \n

 Using vectors will at least save some message space, but are there any other
 benefits to this approach in terms of Hadoop streaming overhead (sorts
 etc.)?  I think buffering issues will not be a huge concern since the length
 of the vectors have a reasonable upper bound and will be in a sparse
 format...


 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch



Re: EC2 Usage?

2008-12-18 Thread Paco NATHAN
Ryan,

A developer on our team wrote some JSP to add to the Job Tracker, so
that job times and other stats could be accessed programmatically via
web services:

   https://issues.apache.org/jira/browse/HADOOP-4559

There's another update coming for that patch in JIRA, to get task data.

Paco



On Thu, Dec 18, 2008 at 11:17, Tom White t...@cloudera.com wrote:
 Hi Ryan,

 The ec2-describe-instances command in the API tool reports the launch
 time for each instance, so you could work out the machine hours of
 your cluster using that information.

 Tom

 On Thu, Dec 18, 2008 at 4:59 PM, Ryan LeCompte lecom...@gmail.com wrote:
 Hello all,

 Somewhat of a an off-topic related question, but I know there are
 Hadoop + EC2 users here. Does anyone know if there is a programmatic
 API to get find out how many machine time hours have been used by a
 Hadoop cluster (or anything) running on EC2? I know that you can log
 into the EC2 web site and see this, but I'm wondering if there's a way
 to access this data programmaticly via web services?

 Thanks,
 Ryan




Re: Getting Reduce Output Bytes

2008-11-25 Thread Paco NATHAN
Hi Lohit,

Our teams collects those kinds of measurements using this patch:
   https://issues.apache.org/jira/browse/HADOOP-4559

Some example Java code in the comments shows how to access the data,
which is serialized as JSON.  Looks like the red_hdfs_bytes_written
value would give you that.

Best,
Paco


On Tue, Nov 25, 2008 at 00:28, lohit [EMAIL PROTECTED] wrote:
 Hello,

 Is there an easy way to get Reduce Output Bytes?

 Thanks,
 Lohit




Re: combiner stats

2008-11-18 Thread Paco NATHAN
Thank you, Devaraj -
That explanation helps a lot.

Is the following reasonable to say?

Combine input records count shown in the Map phase column of the
report is a measure of how many times records have passed through the
Combiner during merges of intermediate spills. Therefore, it may be
larger than the actual count of records which are being merged.


Paco


 On the map side, the combiner is called after sort and during the merges of
 the intermediate spills. At the end a single spill file is generated. Note
 that, during the merges, the same record may pass multiple times through the
 combiner.

On Mon, Nov 17, 2008 at 23:04, Devaraj Das [EMAIL PROTECTED] wrote:



 On 11/18/08 3:59 AM, Paco NATHAN [EMAIL PROTECTED] wrote:

 Could someone please help explain the job counters shown for Combine
 records on the JobTracker JSP page?

 Here's an example from one of our MR jobs.  There are Combine input
 and output record counters shown for both Map phase and Reduce phase.
 We're not quite sure how to interpret them -

 Map Phase:
Map input records   85,013,261,279
Map output records   85,013,261,279
Combine input records   114,936,724,505
Combine output records   38,750,511,975

 Reduce Phase:
Combine input records   8,827,017,275
Combine output records   17,986,654
Reduce input groups   2,221,796
Reduce input records   17,986,654
Reduce output records   4,443,590


 What makes sense:
* Considering the MR job and its data, the 85.0b count for Map
 output records is expected
* I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
* Reduce phase shows Combine output records at 18.0m = Reduce input
 records at 18.0m
* Reduce input groups at 2.2m is expected
* Reduce output records at 4.4m is verified

 What doesn't make sense:
* The 115b count for Combine input records during Map phase
* The 8.8b count for Combine input records during Reduce phase


 On the map side, the combiner is called after sort and during the merges of
 the intermediate spills. At the end a single spill file is generated. Note
 that, during the merges, the same record may pass multiple times through the
 combiner.
 On the reducer side, the combiner would be called only during merges of
 intermediate data, and the intermediate merges stops at a certain point (we
 have = io.sort.factor files remaining). Hence the combiner may be called
 fewer times here...

 What would be the actual count of records coming out of the Map phase?

 Thanks,
 Paco





combiner stats

2008-11-17 Thread Paco NATHAN
Could someone please help explain the job counters shown for Combine
records on the JobTracker JSP page?

Here's an example from one of our MR jobs.  There are Combine input
and output record counters shown for both Map phase and Reduce phase.
We're not quite sure how to interpret them -

Map Phase:
   Map input records   85,013,261,279
   Map output records   85,013,261,279
   Combine input records   114,936,724,505
   Combine output records   38,750,511,975

Reduce Phase:
   Combine input records   8,827,017,275
   Combine output records   17,986,654
   Reduce input groups   2,221,796
   Reduce input records   17,986,654
   Reduce output records   4,443,590


What makes sense:
   * Considering the MR job and its data, the 85.0b count for Map
output records is expected
   * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
   * Reduce phase shows Combine output records at 18.0m = Reduce input
records at 18.0m
   * Reduce input groups at 2.2m is expected
   * Reduce output records at 4.4m is verified

What doesn't make sense:
   * The 115b count for Combine input records during Map phase
   * The 8.8b count for Combine input records during Reduce phase

What would be the actual count of records coming out of the Map phase?

Thanks,
Paco


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Paco NATHAN
Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.

In a nutshell, that's one approach for managing a Hadoop cluster
remotely on EC2.

Best,
Paco


On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson [EMAIL PROTECTED] wrote:

 On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:

 This workflow could be initiated from a crontab -- totally automated.
 However, we still see occasional failures of the cluster, and must
 restart manually, but not often.  Stability for that has improved much
 since the 0.18 release.  For us, it's getting closer to total
 automation.

 FWIW, that's running on EC2 m1.xl instances.

 Same here.  I've always had the namenode and web interface be accessible,
 but sometimes I don't get the slave nodes - usually zero slaves when this
 happens, sometimes I only miss one or two.  My rough estimate is that this
 happens 1% of the time.

 I currently have to notice this and restart manually.  Do you have a good
 way to detect it?  I have several Hadoop clusters running at once with the
 same AWS image and SSH keypair, so I can't count running instances.  I could
 have a separate keypair per cluster and count instances with that keypair,
 but I'd like to be able to start clusters opportunistically, with more than
 one cluster doing the same kind of job on different data.


 Karl Anderson
 [EMAIL PROTECTED]
 http://monkey.org/~kra






Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Paco NATHAN
Hi Stuart,

Yes, we do that.  Ditto on most of what Chris described.

We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from
S3 when it launches. That controls the versions for tools/frameworks,
instead of redoing an AMI each time a tool has an update.

A remote server -- in our data center -- acts as a controller, to
launch and manage the cluster.  FWIW, an engineer here wrote those
scripts in Python using boto.  Had to patch boto, which was
submitted.

The mapper for the first MR job in the workflow streams in data from
S3.  Reducers in subsequent jobs have the option to write output to S3
(as Chris mentioned)

After the last MR job in the workflow completes, it pushes a message
into SQS.  The remote server polls SQS, then performs a shutdown of
the cluster.

We may replace use of SQS with RabbitMQ -- more flexible to broker
other kinds of messages between Hadoop on AWS and the
controller/consumer of results back in our data center.


This workflow could be initiated from a crontab -- totally automated.
However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved much
since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.

Paco



On Thu, Oct 23, 2008 at 9:47 AM, Stuart Sierra [EMAIL PROTECTED] wrote:
 Hi folks,
 Anybody tried scripting Hadoop on EC2 to...
 1. Launch a cluster
 2. Pull data from S3
 3. Run a job
 4. Copy results to S3
 5. Terminate the cluster
 ... without any user interaction?

 -Stuart



Re: Can jobs be configured to be sequential

2008-10-17 Thread Paco NATHAN
Hi Ravion,

The problem you are describing sounds like a workflow where you must
be careful to verify certain conditions before proceeding to a next
step.

We have similar kinds of use cases for Hadoop apps at work, which are
essentially ETL.  I recommend that you look at http://cascading.org as
an abstraction layer for managing these kinds of workflows. We've
found it quite useful.

Best,
Paco


On Fri, Oct 17, 2008 at 8:29 PM, Ravion [EMAIL PROTECTED] wrote:
 Dear all,

 We have in our Data Warehouse System, about 600  ETL( Extract Transform Load) 
 jobs to create interim data model. SOme jobs are dependent on completion of 
 others.

 Assume that I create a group id intdependent jobs. Say a group G1 contains 
 100 jobs , G2 contains another 200 jobs which are dependent on completion of 
 Group G1 and so on.

 Can we leverage on Haddop so that Hadoop executed G1 first, on failure it 
 wont execute G2 otherwise will continue for G2 and so  on.. ?

 Or do I need to configure N ( where N =  total number of groups) Haddop 
 jobs independently and handle by ourselves?

 Please share your thoughts, thanks

 Warmest regards,
 Ravion


Re: Questions about Hadoop

2008-09-26 Thread Paco NATHAN
Edward,

Can you describe more about Hama, with respect to Hadoop?
I've read through the Incubator proposal and your blog -- it's a great approach.

Are there any benchmarks available?  E.g., size of data sets used,
kinds of operations performed, etc.

Will this project be able to make use of existing libraries?

Best,
Paco


On Thu, Sep 25, 2008 at 9:31 PM, Edward J. Yoon [EMAIL PROTECTED] wrote:
 The decision making system seems interesting to me. :)

 The question I want to ask is whether it is possible to perform statistical 
 analysis on the data using Hadoop and MapReduce.

 I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use
 to matrix algebra and its uses in statistical analysis on Hadoop and
 Hbase. (It is still in its early stage)

 /Edward


Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
Arijit,

For workflow, check out http://cascading.org  -- that works quite well
and fits what you described.

Greenplum and Aster Data have announced support for running MR within
the context of their relational databases, e.g.,
http://www.greenplum.com/resources/mapreduce/

In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be
quite good for situations where there are lots of ad hoc queries,
business intelligence needs short-term, less-technical staff involved.
However, if there are large, repeated batch jobs which require
significant analytics work, then I'm not so convinced that SQL is the
right mind-set for representing the math required for algorithms or
for maintaining complex code throughout the software lifecycle.


I run an analytics group where our statisticians use R, while our
developers use Hadoop, Cascading, etc., at scale on terabytes.  One
approach is simply to sample data, analyze it in R, then use the
analysis to articulate requirements for developers to use at scale.

In terms of running R on large data, one issue is that -- in contrast
to SAS, where data is handled line-by-line -- R is limited by how much
data can be loaded into memory.

Another issue is that while some areas of statistical data analysis
are suitable for MapReduce, others clearly are not. Mahout or similar
projects may go far, but do not expect them to be capable of
displacing R, SAS, etc.  For example, you can accomplish much by
scanning a data set to determine N, sum X, sum X^X, etc., to produce
descriptive stats, quantiles, C.I., plots for p.d.f., c.d.f., etc.
Quite useful. However, MapReduce requires data independence, so it
will not serve well for tasks such as inverting a matrix.

You might want to look into Parallel R, and talk with
http://www.revolution-computing.com/

Our team has a project which runs Hadoop workflows underneath R.  It
is at an early stage, and there's no plan yet about a public release.
It's not a simple thing to implement by any stretch of the
imagination!

Best,
Paco



On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee
[EMAIL PROTECTED] wrote:
 Thanx Enis.

 By workflow, I was trying to mean something like a chain of MapReduce
 jobs - the first one will extract a certain amount of data from the
 original set and do some computation resulting in a smaller summary,
 which will then be the input to a further MR job, and so on...somewhat
 similar to a workflow as in the SOA world.

 Is it possible to use statistical analysis tools such as R (or say PL/R)
 within MapReduce on Hadoop? As far as I've heard, Greenplum is working
 on a custom MapReduce engine over their Greenplum database which will
 also support PL/R procedures.

 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107
 http://www.connectivasystems.com



Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
Certainly. It'd be great to talk with others working in analytics and
statistical computing, who have been evaluating MapReduce as well.

Paco


On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee
[EMAIL PROTECTED] wrote:
 That's a very good overview Paco - thanx for that. I might get back to
 you with more queries about cascade etc. at some time - hope you
 wouldn't mind.

 Regards
 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107
 http://www.connectivasystems.com


 -Original Message-
 From: Paco NATHAN [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 24, 2008 6:10 PM
 To: core-user@hadoop.apache.org; [EMAIL PROTECTED]
 Subject: Re: Questions about Hadoop


 Arijit,

 For workflow, check out http://cascading.org  -- that works quite well
 and fits what you described.

 Greenplum and Aster Data have announced support for running MR within
 the context of their relational databases, e.g.,
 http://www.greenplum.com/resources/mapreduce/

 In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite
 good for situations where there are lots of ad hoc queries, business
 intelligence needs short-term, less-technical staff involved. However,
 if there are large, repeated batch jobs which require significant
 analytics work, then I'm not so convinced that SQL is the right mind-set
 for representing the math required for algorithms or for maintaining
 complex code throughout the software lifecycle.


 I run an analytics group where our statisticians use R, while our
 developers use Hadoop, Cascading, etc., at scale on terabytes.  One
 approach is simply to sample data, analyze it in R, then use the
 analysis to articulate requirements for developers to use at scale.

 In terms of running R on large data, one issue is that -- in contrast to
 SAS, where data is handled line-by-line -- R is limited by how much data
 can be loaded into memory.

 Another issue is that while some areas of statistical data analysis are
 suitable for MapReduce, others clearly are not. Mahout or similar
 projects may go far, but do not expect them to be capable of displacing
 R, SAS, etc.  For example, you can accomplish much by scanning a data
 set to determine N, sum X, sum X^X, etc., to produce descriptive stats,
 quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However,
 MapReduce requires data independence, so it will not serve well for
 tasks such as inverting a matrix.

 You might want to look into Parallel R, and talk with
 http://www.revolution-computing.com/

 Our team has a project which runs Hadoop workflows underneath R.  It is
 at an early stage, and there's no plan yet about a public release. It's
 not a simple thing to implement by any stretch of the imagination!

 Best,
 Paco



 On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee
 [EMAIL PROTECTED] wrote:
 Thanx Enis.

 By workflow, I was trying to mean something like a chain of MapReduce
 jobs - the first one will extract a certain amount of data from the
 original set and do some computation resulting in a smaller summary,
 which will then be the input to a further MR job, and so on...somewhat

 similar to a workflow as in the SOA world.

 Is it possible to use statistical analysis tools such as R (or say
 PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is
 working on a custom MapReduce engine over their Greenplum database
 which will also support PL/R procedures.

 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com

 No virus found in this incoming message.
 Checked by AVG - http://www.avg.com
 Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
 9/23/2008 6:32 PM





Re: How to manage a large cluster?

2008-09-16 Thread Paco NATHAN
Thanks, Steve -

Another flexible approach to handling messages across firewalls,
between jt and worker nodes, etc., would be to place an APMQ message
broker on the jobtracker and another inside our local network.  We're
experimenting with RabbitMQ for that.


On Tue, Sep 16, 2008 at 4:03 AM, Steve Loughran [EMAIL PROTECTED] wrote:

 We use a set of Python scripts to manage a daily, (mostly) automated
 launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
 on a local server, so that the Hadoop job can send notification when
 it completes, and allow the local server to initiate download of
 results.  Overall, that minimizes the need for having a sysadmin
 dedicated to the Hadoop jobs -- a small dev team can handle it, while
 focusing on algorithm development and testing.

 1. We have some components that use google talk to relay messages to local
 boxes behind the firewall. I could imagine hooking up hadoop status events
 to that too.

 2. There's an old paper of mine, Making Web Services that Work, in which I
 talk about deployment centric development:
 http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html

 The idea is that right from the outset, the dev team work on a cluster that
 resembles production, the CI server builds to it automatically, changes get
 pushed out to production semi-automatically (you tag the version you want
 pushed out in SVN, the CI server does the release). The article is focused
 on services exported to third parties, not back end stuff, so it may not all
 apply to hadoop deployments.

 -steve


Re: How to manage a large cluster?

2008-09-15 Thread Paco NATHAN
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
make it simple, pull those from S3 buckets. That provides a flexible
pattern for managing the frameworks involved, more so than needing to
re-do an EC2 image whenever you want to add a patch to Hadoop.

Given that approach, you can add your Hadoop application code
similarly. Just upload the current stable build out of SVN, Git,
whatever, to an S3 bucket.

We use a set of Python scripts to manage a daily, (mostly) automated
launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
on a local server, so that the Hadoop job can send notification when
it completes, and allow the local server to initiate download of
results.  Overall, that minimizes the need for having a sysadmin
dedicated to the Hadoop jobs -- a small dev team can handle it, while
focusing on algorithm development and testing.


  Or on EC2 and its competitors, just build a new image whenever you
 need to update Hadoop itself.


Re: Hadoop (0.18) Spill Failed, out of Heap Space error

2008-09-03 Thread Paco NATHAN
Also, that almost always happens early in the map phase of the first
MR job which runs on our cluster.

Hadoop 0.18.1 on EC2 m1.xl instances.

We run 10 MR jobs in sequence, 6hr duration, not seeing the problem
repeated after that 1 heap space exception.

Paco


On Wed, Sep 3, 2008 at 11:42 AM, Florian Leibert [EMAIL PROTECTED] wrote:
 Hi,
 we're running 100 XLarge instances (ec2), with a gig of heap space for each
 task - and are seeing the following error frequently (but not always):
 # BEGIN PASTE #
 [exec] 08/09/03 11:21:09 INFO mapred.JobClient:  map 43% reduce 5%
 [exec] 08/09/03 11:21:16 INFO mapred.JobClient: Task Id :
 attempt_200809031101_0001_m_000220_0, Status : FAILED
 [exec] java.io.IOException: Spill failed
 [exec] at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:688)
 [exec] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
 [exec] at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 [exec] Caused by: java.lang.OutOfMemoryError: Java heap space
 [exec] at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$InMemValBytes.reset(MapTask.java:928)
 [exec] at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.getVBytesForOffset(MapTask.java:891)
 [exec] at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:765)
 [exec] at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.java:286)
 [exec] at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:712)
 # END #

 Has anyone seen this? Thanks,

 Florian Leibert
 Sr. Software Engineer
 Adknowledge Inc.



Re: question on fault tolerance

2008-08-11 Thread Paco NATHAN
just a guess,
for a long-running sequence of MR jobs, how's the namenode behaving
during that time? if it gets corrupted, one might see that behavior.

we have a similar situation, with 9 MR jobs back-to-back, taking much
of the day.

might be good to add some notification to an external process after
the end of each of those 3 MR jobs.

paco


On Mon, Aug 11, 2008 at 12:34 PM, Mori Bellamy [EMAIL PROTECTED] wrote:
 hey all,
 i have a job consisting of three MR jobs back to back to back. the each job
 takes an appreciable percent of a day to complete (30% to 70%). even though
 i execute these jobs in a blocking fashion:


Re: iterative map-reduce

2008-07-29 Thread Paco NATHAN
A simple example of Hadoop application code which follows that pattern
(iterate until condition). In the jyte section:

   http://code.google.com/p/ceteri-mapred/

Loop and condition test are in the same code which calls ToolRunner
and JobClient.

Best,
Paco


On Tue, Jul 29, 2008 at 10:03 AM, Christian Ulrik Søttrup
[EMAIL PROTECTED] wrote:
 Hi Shirley,

 I am basically doing as Qin suggested.
 I am running a job iteratively until some condition is met.
 My main looks something like:(in pseudo code)

 main:
 while (!converged):
  make new jobconf
  setup jobconf
  run jobconf
  check reporter for statistics
  decide if converged

 I use a custom reporter to check on the fitness of the solution in the
 reduce phase.

 If you need more(real java) code drop me a line.

 Cheers,
 Christian


Re: JobTracker History data+analysis

2008-07-28 Thread Paco NATHAN
Thank you, Amareshwari -

That helps.  Hadn't noticed HistoryViewer before. It has no JavaDoc.

What is a typical usage?  In other words, what would be the
outputDir value in the context of ToolRunner, JobClient, etc. ?

Paco


On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu
[EMAIL PROTECTED] wrote:
 Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it
 make sense?

 Thanks
 Amareshwari

 Paco NATHAN wrote:

 We have a need to access data found in the JobTracker History link.
 Specifically in the Analyse This Job analysis. Must be run in Java,
 between jobs, in the same code which calls ToolRunner and JobClient.
 In essence, we need to collect descriptive statistics about task
 counts and times for map, shuffle, reduce.

 After tracing the flow of the JSP in src/webapps/job...  Is there a
 better way to get at this data, *not* from the web UI perspective but
 from the code?

 Tried to find any applicable patterns in JobTracker, ClusterStatus,
 JobClient, etc., but no joy.

 Thanks,
 Paco





Re: JobTracker History data+analysis

2008-07-28 Thread Paco NATHAN
Thanks Amareshwari -

That could be quite useful to access summary analysis from within the code.

Currently this is not written as a public class, which makes it
difficult to use inside application code.

Are there plans to make it a public class?


Paco


On Mon, Jul 28, 2008 at 1:42 AM, Amareshwari Sriramadasu
[EMAIL PROTECTED] wrote:
 HistoryViewer is used in JobClient to view the history files in the
 directory provided on the command line. The command is
 $ bin/hadoop job -history history-dir  #by default history is stored in
 output dir.
 outputDir in the constructor of HistoryViewer is the directory passed on the
 command-line.

 You can specify a location to store the history files of a particular job
 using hadoop.job.history.user.location. If nothing is specified, the logs
 are stored in the job's
 output directory i.e. mapred.output.dir. The files are stored in
 _logs/history/ inside the directory.
 Thanks
 Amareshwari

 Paco NATHAN wrote:

 Thank you, Amareshwari -

 That helps.  Hadn't noticed HistoryViewer before. It has no JavaDoc.

 What is a typical usage?  In other words, what would be the
 outputDir value in the context of ToolRunner, JobClient, etc. ?

 Paco


 On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu
 [EMAIL PROTECTED] wrote:


 Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if
 it
 make sense?

 Thanks
 Amareshwari

 Paco NATHAN wrote:


 We have a need to access data found in the JobTracker History link.
 Specifically in the Analyse This Job analysis. Must be run in Java,
 between jobs, in the same code which calls ToolRunner and JobClient.
 In essence, we need to collect descriptive statistics about task
 counts and times for map, shuffle, reduce.

 After tracing the flow of the JSP in src/webapps/job...  Is there a
 better way to get at this data, *not* from the web UI perspective but
 from the code?

 Tried to find any applicable patterns in JobTracker, ClusterStatus,
 JobClient, etc., but no joy.

 Thanks,
 Paco








JobTracker History data+analysis

2008-07-27 Thread Paco NATHAN
We have a need to access data found in the JobTracker History link.
Specifically in the Analyse This Job analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map, shuffle, reduce.

After tracing the flow of the JSP in src/webapps/job...  Is there a
better way to get at this data, *not* from the web UI perspective but
from the code?

Tried to find any applicable patterns in JobTracker, ClusterStatus,
JobClient, etc., but no joy.

Thanks,
Paco


Re: Using MapReduce to do table comparing.

2008-07-23 Thread Paco NATHAN
This is merely an in the ballpark calculation, regarding that 10
minute / 4-node requirement...

We have a reasonably similar Hadoop job (slightly more complex in the
reduce phase) running on AWS with:

   * 100+2 nodes (m1.xl config)
   * approx 3x the number of rows and data size
   * completes in 6 minutes

You have faster processors.

So it might require more on the order of 25-35 nodes for 10 min
completion. That's a very rough estimate.

Those other two steps (deletes, inserts) might be performed in the
same pass as the compares -- and potentially quicker overall, when you
consider the time to load and recreate in the RDBMS.

Paco


On Wed, Jul 23, 2008 at 9:33 AM, Amber [EMAIL PROTECTED] wrote:
 We have a 10 million row table exported from AS400 mainframe every day, the 
 table is exported as a csv text file, which is about 30GB in size, then the 
 csv file is imported into a RDBMS table which is dropped and recreated every 
 day. Now we want to find how many rows are updated during each export-import 
 interval, the table has a primary key, so deletes and inserts can be found 
 using RDBMS joins quickly, but we must do a column to column comparing in 
 order to find the difference between rows ( about 90%) with the same primary 
 keys. Our goal is to find a comparing process which takes no more than 10 
 minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
 CPUs, 8GB memory  and a  300G local  RAID5 array.


Re: Large Weblink Graph

2008-04-15 Thread Paco NATHAN
Another site which has data sets available for study is UCI Machine
Learning Repository:
   http://archive.ics.uci.edu/ml/


On Tue, Apr 15, 2008 at 8:29 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote:

  Does anyone have large Weblink graph ? I want to experiment and benchmark
  MapReduce with some real dataset.


Re: walkthrough of developing first hadoop app from scratch

2008-03-21 Thread Paco NATHAN
Hi Stephen,

Here's a sample Hadoop app which has its build based on Ant:
   http://code.google.com/p/ceteri-mapred/

Look in the jyte directory.  A target called prep.jar simply uses
the jar/ task in Ant to build a JAR for Hadoop to use.

Yeah, I agree that docs and discussions seem to lean more toward
systems engineering and are sparse about writing applications. A few
folks are trying to change that :)



On Fri, Mar 21, 2008 at 6:35 PM, Stephen J. Barr
[EMAIL PROTECTED] wrote:
 Hello,

  I am working on developing my first hadoop app from scratch. It is a
  Monte-Carlo simulation, and I am using the PiEstimator code from the
  examples as a reference. I believe I have what I want in a .java file.
  However, I couldn't find any documentation on how to make that .java
  file into a .jar that I could run, and I haven't found much
  documentation that is hadoop specific.

  Is it basically javac MyApp.java
  jar -cf MyApp

  or something to that effect, or is there more to it?

  Thanks! Sorry for the newbie question.

  -stephen barr




Re: Add your project or company to the powered by page?

2008-02-21 Thread Paco NATHAN
More on the subject of outreach, not specific uses at companies, but...
A couple things might help get the word out:

   - Add a community group in LinkedIn (shows up on profile searches)
 http://www.linkedin.com/static?key=groups_faq

   - Add a link on the wiki to the Facebook group about Hadoop
http://www.facebook.com/pages/Hadoop/9887781514

There's also a small but growing network of local user groups for
Amazon AWS, and much interest there for presentations and discussions
about Hadoop:
   
http://www.amazon.com/Upcoming-Events-AWS-home-page/b/ref=sc_fe_c_0_371080011_1/103-5668663-1566203?ie=UTF8node=16284451no=371080011me=A36L942TSJ2AJA


I'd be happy to help with any of those.
Paco



On Wed, Feb 20, 2008 at 10:26 PM, Eric Baldeschwieler
[EMAIL PROTECTED] wrote:
 Hi Folks,

  Let's get the word out that Hadoop is being used and is useful in
  your organizations, ok?  Please add yourselves to the Hadoop powered
  by page, or reply to this email with what details you would like to
  add and I'll do it.

  http://wiki.apache.org/hadoop/PoweredBy

  Thanks!

  E14

  ---
  eric14 a.k.a. Eric Baldeschwieler
  senior director, grid computing
  Yahoo!  Inc.